Class: Polars::LazyFrame
- Inherits:
-
Object
- Object
- Polars::LazyFrame
- Defined in:
- lib/polars/lazy_frame.rb
Overview
Representation of a Lazy computation graph/query against a DataFrame.
Class Method Summary collapse
-
.deserialize(source, format: "binary") ⇒ LazyFrame
Read a logical plan from a file to construct a LazyFrame.
Instance Method Summary collapse
-
#bottom_k(k, by:, reverse: false) ⇒ LazyFrame
Return the
ksmallest rows. -
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
-
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
-
#clear(n = 0) ⇒ LazyFrame
Create an empty copy of the current LazyFrame.
-
#collect(engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Materialize this LazyFrame into a DataFrame.
-
#collect_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
Evaluate the query in streaming mode and get a generator that returns chunks.
-
#collect_schema ⇒ Schema
Resolve the schema of this LazyFrame.
-
#columns ⇒ Array
Get or set column names.
-
#count ⇒ LazyFrame
Return the number of non-null elements for each column.
-
#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame
Creates a summary of statistics for a LazyFrame, returning a DataFrame.
-
#drop(*columns, strict: true) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
-
#drop_nans(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more NaN values.
-
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more null values.
-
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
-
#explain(format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ String
Create a string representation of the query plan.
-
#explode(columns, *more_columns, empty_as_null: true, keep_nulls: true) ⇒ LazyFrame
Explode lists to long format.
-
#fill_nan(value) ⇒ LazyFrame
Fill floating point NaN values.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ LazyFrame
Fill null values using the specified value or strategy.
-
#filter(*predicates, **constraints) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#first ⇒ LazyFrame
Get the first row of the DataFrame.
-
#gather_every(n, offset: 0) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
-
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame
Group based on a time value (or index value of type Int32, Int64).
-
#head(n = 5) ⇒ LazyFrame
Get the first
nrows. -
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil) ⇒ LazyFrame
constructor
Create a new LazyFrame.
-
#interpolate ⇒ LazyFrame
Interpolate intermediate values.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame
Perform an asof join.
-
#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame
Perform a join based on one or multiple (in)equality predicates.
-
#last ⇒ LazyFrame
Get the last row of the DataFrame.
-
#lazy ⇒ LazyFrame
Return lazy representation, i.e.
-
#limit(n = 5) ⇒ LazyFrame
Get the first
nrows. -
#map_batches(predicate_pushdown: true, projection_pushdown: true, slice_pushdown: true, no_optimizations: false, schema: nil, validate_output_schema: true, streamable: false, &function) ⇒ LazyFrame
Apply a custom function.
-
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
-
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
-
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
-
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
-
#null_count ⇒ LazyFrame
Aggregate the columns in the LazyFrame as the sum of their null value count.
-
#pipe(function, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#pivot(on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_") ⇒ LazyFrame
Create a spreadsheet-style pivot table as a DataFrame.
-
#profile(engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Array
Profile a LazyFrame.
-
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
-
#remove(*predicates, **constraints) ⇒ LazyFrame
Remove rows, dropping those that match the given predicate expression(s).
-
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
-
#reverse ⇒ LazyFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ LazyFrame
Create rolling groups based on a time column.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
-
#select_seq(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this LazyFrame.
-
#serialize(file = nil, format: "binary") ⇒ Object
Serialize the logical plan of this LazyFrame to a file or string.
-
#set_sorted(column, *more_columns, descending: false, nulls_last: false) ⇒ LazyFrame
Flag a column as sorted.
-
#shift(n = 1, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
-
#show_graph(optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
Show a plot of the query plan.
-
#sink_csv(path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
-
#sink_ipc(path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
-
#sink_ndjson(path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
-
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Persists a LazyFrame at the provided path.
-
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
-
#sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
-
#sql(query, table_name: "self") ⇒ Expr
Execute a SQL query against the LazyFrame.
-
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
-
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
-
#tail(n = 5) ⇒ LazyFrame
Get the last
nrows. -
#to_s ⇒ String
Returns a string representing the LazyFrame.
-
#top_k(k, by:, reverse: false) ⇒ LazyFrame
Return the
klargest rows. -
#unique(maintain_order: false, subset: nil, keep: "any") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
-
#unnest(columns, *more_columns, separator: nil) ⇒ LazyFrame
Decompose a struct into its fields.
-
#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
Unpivot a DataFrame from wide to long format.
-
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame
Update the values in this
LazyFramewith the values inother. -
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
-
#width ⇒ Integer
Get the width of the LazyFrame.
-
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
-
#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame
Add columns to this LazyFrame.
-
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame
Add a column at index 0 that counts the rows.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil) ⇒ LazyFrame
Create a new LazyFrame.
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# File 'lib/polars/lazy_frame.rb', line 8 def initialize( data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil ) self._ldf = ( DataFrame.new( data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length, nan_to_null: nan_to_null, height: height ) .lazy ._ldf ) end |
Class Method Details
.deserialize(source, format: "binary") ⇒ LazyFrame
This function uses marshaling if the logical plan contains Ruby UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Read a logical plan from a file to construct a LazyFrame.
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/polars/lazy_frame.rb', line 76 def self.deserialize(source, format: "binary") if Utils.pathlike?(source) source = Utils.normalize_filepath(source) end if format == "binary" raise Todo unless RbLazyFrame.respond_to?(:deserialize_binary) deserializer = RbLazyFrame.method(:deserialize_binary) elsif format == "json" deserializer = RbLazyFrame.method(:deserialize_json) else msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}" raise ArgumentError, msg end _from_rbldf(deserializer.(source)) end |
Instance Method Details
#bottom_k(k, by:, reverse: false) ⇒ LazyFrame
Return the k smallest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call :func:sort after this function if you wish the
output to be sorted.
849 850 851 852 853 854 855 856 857 |
# File 'lib/polars/lazy_frame.rb', line 849 def bottom_k( k, by:, reverse: false ) by = Utils.parse_into_list_of_expressions(by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") _from_rbldf(_ldf.bottom_k(k, by, reverse)) end |
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
1668 1669 1670 |
# File 'lib/polars/lazy_frame.rb', line 1668 def cache _from_rbldf(_ldf.cache) end |
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 |
# File 'lib/polars/lazy_frame.rb', line 1721 def cast(dtypes, strict: true) if !dtypes.is_a?(Hash) return _from_rbldf(_ldf.cast_all(dtypes, strict)) end cast_map = {} dtypes.each do |c, dtype| dtype = Utils.parse_into_dtype(dtype) cast_map.merge!( c.is_a?(::String) ? {c => dtype} : Utils.(self, c).to_h { |x| [x, dtype] } ) end _from_rbldf(_ldf.cast(cast_map, strict)) end |
#clear(n = 0) ⇒ LazyFrame
Create an empty copy of the current LazyFrame.
The copy has an identical schema but no data.
1773 1774 1775 |
# File 'lib/polars/lazy_frame.rb', line 1773 def clear(n = 0) DataFrame.new(schema: schema).clear(n).lazy end |
#collect(engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Materialize this LazyFrame into a DataFrame.
By default, all query optimizations are enabled. Individual optimizations may
be disabled by setting the corresponding parameter to false.
966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 |
# File 'lib/polars/lazy_frame.rb', line 966 def collect( engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if engine == "streaming" Utils.issue_unstable_warning("streaming mode is considered unstable.") end ldf = _ldf.with_optimizations(optimizations._rboptflags) if background Utils.issue_unstable_warning("background mode is considered unstable.") return InProcessQuery.new(ldf.collect_concurrently) end Utils.wrap_df(ldf.collect(engine)) end |
#collect_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.
Evaluate the query in streaming mode and get a generator that returns chunks.
This allows streaming results that are larger than RAM to be written to disk.
The query will always be fully executed unless stop is called, so you should
call next until all chunks have been seen.
1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 |
# File 'lib/polars/lazy_frame.rb', line 1628 def collect_batches( chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) ldf = _ldf.with_optimizations(optimizations._rboptflags) inner = ldf.collect_batches( engine, maintain_order, chunk_size, lazy ) CollectBatches.new(inner) end |
#collect_schema ⇒ Schema
Resolve the schema of this LazyFrame.
1017 1018 1019 |
# File 'lib/polars/lazy_frame.rb', line 1017 def collect_schema Schema.new(_ldf.collect_schema, check_dtypes: false) end |
#columns ⇒ Array
Get or set column names.
112 113 114 |
# File 'lib/polars/lazy_frame.rb', line 112 def columns _ldf.collect_schema.keys end |
#count ⇒ LazyFrame
Return the number of non-null elements for each column.
5125 5126 5127 |
# File 'lib/polars/lazy_frame.rb', line 5125 def count _from_rbldf(_ldf.count) end |
#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame
The median is included by default as the 50% percentile.
This method does not maintain the laziness of the frame, and will collect
the final result. This could potentially be an expensive operation.
We do not guarantee the output of describe to be stable. It will show
statistics that we deem informative, and may be updated in the future.
Using describe programmatically (versus interactive exploration) is
not recommended for this reason.
Creates a summary of statistics for a LazyFrame, returning a DataFrame.
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 |
# File 'lib/polars/lazy_frame.rb', line 341 def describe( percentiles: [0.25, 0.5, 0.75], interpolation: "nearest" ) schema = collect_schema.to_h if schema.empty? msg = "cannot describe a LazyFrame that has no columns" raise TypeError, msg end # create list of metrics metrics = ["count", "null_count", "mean", "std", "min"] if (quantiles = Utils.parse_percentiles(percentiles)).any? metrics.concat(quantiles.map { |q| "%g%%" % [q * 100] }) end metrics.append("max") skip_minmax = lambda do |dt| dt.nested? || [Categorical, Enum, Null, Object, Unknown].include?(dt) end # determine which columns will produce std/mean/percentile/etc # statistics in a single pass over the frame schema has_numeric_result, sort_cols = Set.new, Set.new metric_exprs = [] null = F.lit(nil) schema.each do |c, dtype| is_numeric = dtype.numeric? is_temporal = !is_numeric && dtype.temporal? # counts count_exprs = [ F.col(c).count.name.prefix("count:"), F.col(c).null_count.name.prefix("null_count:") ] # mean mean_expr = if is_temporal || is_numeric || dtype == Boolean F.col(c).mean else null end # standard deviation, min, max expr_std = is_numeric ? F.col(c).std : null min_expr = !skip_minmax.(dtype) ? F.col(c).min : null max_expr = !skip_minmax.(dtype) ? F.col(c).max : null # percentiles pct_exprs = [] quantiles.each do |p| if is_numeric || is_temporal pct_expr = if is_temporal F.col(c).to_physical.quantile(p, interpolation: interpolation).cast(dtype) else F.col(c).quantile(p, interpolation: interpolation) end sort_cols.add(c) else pct_expr = null end pct_exprs << pct_expr.alias("#{p}:#{c}") end if is_numeric || dtype.nested? || [Null, Boolean].include?(dtype) has_numeric_result.add(c) end # add column expressions (in end-state 'metrics' list order) metric_exprs.concat( [ *count_exprs, mean_expr.alias("mean:#{c}"), expr_std.alias("std:#{c}"), min_expr.alias("min:#{c}"), *pct_exprs, max_expr.alias("max:#{c}") ] ) end # calculate requested metrics in parallel, then collect the result df_metrics = ( ( # if more than one quantile, sort the relevant columns to make them O(1) # TODO: drop sort once we have efficient retrieval of multiple quantiles sort_cols ? with_columns(sort_cols.map { |c| F.col(c).sort }) : self ) .select(*metric_exprs) .collect ) # reshape wide result n_metrics = metrics.length column_metrics = schema.length.times.map do |n| df_metrics.row(0)[(n * n_metrics)...((n + 1) * n_metrics)] end summary = schema.keys.zip(column_metrics).to_h # cast by column type (numeric/bool -> float), (other -> string) schema.each_key do |c| summary[c] = summary[c].map do |v| if v.nil? || v.is_a?(Hash) nil else if has_numeric_result.include?(c) if v == true 1.0 elsif v == false 0.0 else v.to_f end else "#{v}" end end end end # return results as a DataFrame df_summary = Polars.from_hash(summary) df_summary.insert_column(0, Polars::Series.new("statistic", metrics)) df_summary end |
#drop(*columns, strict: true) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 |
# File 'lib/polars/lazy_frame.rb', line 3369 def drop(*columns, strict: true) selectors = [] columns.each do |c| if c.is_a?(Enumerable) selectors += c else selectors += [c] end end drop_cols = Utils.parse_list_into_selector(selectors, strict: strict) _from_rbldf(_ldf.drop(drop_cols._rbselector)) end |
#drop_nans(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
4314 4315 4316 4317 4318 4319 4320 |
# File 'lib/polars/lazy_frame.rb', line 4314 def drop_nans(subset: nil) selector_subset = nil if !subset.nil? selector_subset = Utils.parse_list_into_selector(subset)._rbselector end _from_rbldf(_ldf.drop_nans(selector_subset)) end |
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
4363 4364 4365 4366 4367 4368 4369 |
# File 'lib/polars/lazy_frame.rb', line 4363 def drop_nulls(subset: nil) selector_subset = nil if !subset.nil? selector_subset = Utils.parse_list_into_selector(subset)._rbselector end _from_rbldf(_ldf.drop_nulls(selector_subset)) end |
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
130 131 132 |
# File 'lib/polars/lazy_frame.rb', line 130 def dtypes _ldf.collect_schema.values end |
#explain(format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ String
Create a string representation of the query plan.
Different optimizations can be turned on or off.
490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 |
# File 'lib/polars/lazy_frame.rb', line 490 def explain( format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if engine == "streaming" Utils.issue_unstable_warning("streaming mode is considered unstable.") end if optimized ldf = _ldf.with_optimizations(optimizations._rboptflags) if format == "tree" return ldf.describe_optimized_plan_tree else return ldf.describe_optimized_plan end end if format == "tree" _ldf.describe_plan_tree else _ldf.describe_plan end end |
#explode(columns, *more_columns, empty_as_null: true, keep_nulls: true) ⇒ LazyFrame
Explode lists to long format.
4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 |
# File 'lib/polars/lazy_frame.rb', line 4191 def explode( columns, *more_columns, empty_as_null: true, keep_nulls: true ) subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector( more_columns ) _from_rbldf(_ldf.explode(subset._rbselector, empty_as_null, keep_nulls)) end |
#fill_nan(value) ⇒ LazyFrame
Note that floating point NaN (Not a Number) are not missing values!
To replace missing values, use fill_null instead.
Fill floating point NaN values.
3930 3931 3932 3933 3934 3935 |
# File 'lib/polars/lazy_frame.rb', line 3930 def fill_nan(value) if !value.is_a?(Expr) value = F.lit(value) end _from_rbldf(_ldf.fill_nan(value._rbexpr)) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ LazyFrame
Fill null values using the specified value or strategy.
3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 |
# File 'lib/polars/lazy_frame.rb', line 3855 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) if !value.nil? if value.is_a?(Expr) dtypes = nil elsif value.is_a?(TrueClass) || value.is_a?(FalseClass) dtypes = [Boolean] elsif matches_supertype && (value.is_a?(Integer) || value.is_a?(Float)) dtypes = [ Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt64, Float32, Float64, Decimal.new ] elsif value.is_a?(Integer) dtypes = [Int64] elsif value.is_a?(Float) dtypes = [Float64] elsif value.is_a?(::Date) dtypes = [Date] elsif value.is_a?(::String) dtypes = [String, Categorical] else # fallback; anything not explicitly handled above dtypes = nil end if dtypes return with_columns( F.col(dtypes).fill_null(value, strategy: strategy, limit: limit) ) end end select(Polars.all.fill_null(value, strategy: strategy, limit: limit)) end |
#filter(*predicates, **constraints) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 |
# File 'lib/polars/lazy_frame.rb', line 1892 def filter(*predicates, **constraints) if constraints.empty? # early-exit conditions (exclude/include all rows) if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass)) return dup end if predicates.length == 1 && predicates[0].is_a?(FalseClass) return clear end end _filter( predicates: predicates, constraints: constraints, invert: false ) end |
#first ⇒ LazyFrame
Get the first row of the DataFrame.
3720 3721 3722 |
# File 'lib/polars/lazy_frame.rb', line 3720 def first slice(0, 1) end |
#gather_every(n, offset: 0) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
3782 3783 3784 |
# File 'lib/polars/lazy_frame.rb', line 3782 def gather_every(n, offset: 0) select(F.col("*").gather_every(n, offset)) end |
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy
Start a group by operation.
2196 2197 2198 2199 2200 |
# File 'lib/polars/lazy_frame.rb', line 2196 def group_by(*by, maintain_order: false, **named_by) exprs = Utils.parse_into_list_of_expressions(*by, **named_by) lgb = _ldf.group_by(exprs, maintain_order) LazyGroupBy.new(lgb) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every, period and offset arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 |
# File 'lib/polars/lazy_frame.rb', line 2559 def group_by_dynamic( index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window" ) index_column = Utils.parse_into_expression(index_column, str_as_lit: false) if offset.nil? offset = period.nil? ? "-#{every}" : "0ns" end if period.nil? period = every end period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) every = Utils.parse_as_duration_string(every) rbexprs_by = group_by.nil? ? [] : Utils.parse_into_list_of_expressions(group_by) lgb = _ldf.group_by_dynamic( index_column, every, period, offset, label, include_boundaries, closed, rbexprs_by, start_by ) LazyGroupBy.new(lgb) end |
#head(n = 5) ⇒ LazyFrame
Get the first n rows.
3625 3626 3627 |
# File 'lib/polars/lazy_frame.rb', line 3625 def head(n = 5) slice(0, n) end |
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
167 168 169 |
# File 'lib/polars/lazy_frame.rb', line 167 def include?(key) columns.include?(key) end |
#interpolate ⇒ LazyFrame
Interpolate intermediate values. The interpolation method is linear.
4718 4719 4720 |
# File 'lib/polars/lazy_frame.rb', line 4718 def interpolate select(F.col("*").interpolate) end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 |
# File 'lib/polars/lazy_frame.rb', line 3070 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if maintain_order.nil? maintain_order = "none" end if how == "outer" how = "full" elsif how == "cross" return _from_rbldf( _ldf.join( other._ldf, [], [], allow_parallel, nulls_equal, force_parallel, how, suffix, validate, maintain_order, coalesce ) ) end if !on.nil? rbexprs = Utils.parse_into_list_of_expressions(on) rbexprs_left = rbexprs rbexprs_right = rbexprs elsif !left_on.nil? && !right_on.nil? rbexprs_left = Utils.parse_into_list_of_expressions(left_on) rbexprs_right = Utils.parse_into_list_of_expressions(right_on) else raise ArgumentError, "must specify `on` OR `left_on` and `right_on`" end _from_rbldf( self._ldf.join( other._ldf, rbexprs_left, rbexprs_right, allow_parallel, force_parallel, nulls_equal, how, suffix, validate, maintain_order, coalesce ) ) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 |
# File 'lib/polars/lazy_frame.rb', line 2857 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if on.is_a?(::String) left_on = on right_on = on end if left_on.nil? || right_on.nil? raise ArgumentError, "You should pass the column to join on as an argument." end if by_left.is_a?(::String) || by_left.is_a?(Expr) by_left_ = [by_left] else by_left_ = by_left end if by_right.is_a?(::String) || by_right.is_a?(Expr) by_right_ = [by_right] else by_right_ = by_right end if by.is_a?(::String) by_left_ = [by] by_right_ = [by] elsif by.is_a?(::Array) by_left_ = by by_right_ = by end tolerance_str = nil tolerance_num = nil if tolerance.is_a?(::String) tolerance_str = tolerance else tolerance_num = tolerance end _from_rbldf( _ldf.join_asof( other._ldf, Polars.col(left_on)._rbexpr, Polars.col(right_on)._rbexpr, by_left_, by_right_, allow_parallel, force_parallel, suffix, strategy, tolerance_num, tolerance_str, coalesce, allow_exact_matches, check_sortedness ) ) end |
#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame
The row order of the input DataFrames is not preserved.
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 |
# File 'lib/polars/lazy_frame.rb', line 3219 def join_where( other, *predicates, suffix: "_right" ) Utils.require_same_type(self, other) rbexprs = Utils.parse_into_list_of_expressions(*predicates) _from_rbldf( _ldf.join_where( other._ldf, rbexprs, suffix ) ) end |
#last ⇒ LazyFrame
Get the last row of the DataFrame.
3695 3696 3697 |
# File 'lib/polars/lazy_frame.rb', line 3695 def last tail(1) end |
#lazy ⇒ LazyFrame
Return lazy representation, i.e. itself.
Useful for writing code that expects either a DataFrame or
LazyFrame.
1661 1662 1663 |
# File 'lib/polars/lazy_frame.rb', line 1661 def lazy self end |
#limit(n = 5) ⇒ LazyFrame
Get the first n rows.
Alias for #head.
3580 3581 3582 |
# File 'lib/polars/lazy_frame.rb', line 3580 def limit(n = 5) head(n) end |
#map_batches(predicate_pushdown: true, projection_pushdown: true, slice_pushdown: true, no_optimizations: false, schema: nil, validate_output_schema: true, streamable: false, &function) ⇒ LazyFrame
The schema of a LazyFrame must always be correct. It is up to the caller
of this function to ensure that this invariant is upheld.
It is important that the optimization flags are correct. If the custom function
for instance does an aggregation of a column, predicate_pushdown should not
be allowed, as this prunes rows and will influence your aggregation results.
A UDF passed to map_batches must be pure, meaning that it cannot modify or
depend on state other than its arguments.
Apply a custom function.
It is important that the function returns a Polars DataFrame.
4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 |
# File 'lib/polars/lazy_frame.rb', line 4662 def map_batches( predicate_pushdown: true, projection_pushdown: true, slice_pushdown: true, no_optimizations: false, schema: nil, validate_output_schema: true, streamable: false, &function ) raise Todo if !schema.nil? if no_optimizations predicate_pushdown = false projection_pushdown = false slice_pushdown = false end _from_rbldf( _ldf.map_batches( function, predicate_pushdown, projection_pushdown, slice_pushdown, streamable, schema, validate_output_schema, ) ) end |
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
4017 4018 4019 |
# File 'lib/polars/lazy_frame.rb', line 4017 def max _from_rbldf(_ldf.max) end |
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
4077 4078 4079 |
# File 'lib/polars/lazy_frame.rb', line 4077 def mean _from_rbldf(_ldf.mean) end |
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
4097 4098 4099 |
# File 'lib/polars/lazy_frame.rb', line 4097 def median _from_rbldf(_ldf.median) end |
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
4823 4824 4825 |
# File 'lib/polars/lazy_frame.rb', line 4823 def merge_sorted(other, key) _from_rbldf(_ldf.merge_sorted(other._ldf, key)) end |
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
4037 4038 4039 |
# File 'lib/polars/lazy_frame.rb', line 4037 def min _from_rbldf(_ldf.min) end |
#null_count ⇒ LazyFrame
Aggregate the columns in the LazyFrame as the sum of their null value count.
4123 4124 4125 |
# File 'lib/polars/lazy_frame.rb', line 4123 def null_count _from_rbldf(_ldf.null_count) end |
#pipe(function, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
261 262 263 |
# File 'lib/polars/lazy_frame.rb', line 261 def pipe(function, *args, **kwargs, &block) function.(self, *args, **kwargs, &block) end |
#pivot(on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_") ⇒ LazyFrame
In some other frameworks, you might know this operation as pivot_wider.
Create a spreadsheet-style pivot table as a DataFrame.
4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 |
# File 'lib/polars/lazy_frame.rb', line 4433 def pivot( on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_" ) if index.nil? && values.nil? msg = "`pivot` needs either `index or `values` needs to be specified" raise InvalidOperationError, msg end on_selector = Utils.parse_list_into_selector(on) if !values.nil? values_selector = Utils.parse_list_into_selector(values) end if !index.nil? index_selector = Utils.parse_list_into_selector(index) end if values.nil? values_selector = Polars.cs.all - on_selector - index_selector end if index.nil? index_selector = Polars.cs.all - on_selector - values_selector end agg = F.element if aggregate_function.is_a?(::String) if aggregate_function == "first" agg = agg.first elsif aggregate_function == "item" agg = agg.item elsif aggregate_function == "sum" agg = agg.sum elsif aggregate_function == "max" agg = agg.max elsif aggregate_function == "min" agg = agg.min elsif aggregate_function == "mean" agg = agg.mean elsif aggregate_function == "median" agg = agg.median elsif aggregate_function == "last" agg = agg.last elsif aggregate_function == "len" agg = agg.len elsif aggregate_function == "count" Utils.issue_deprecation_warning( "`aggregate_function='count'` input for `pivot` is deprecated." + " Please use `aggregate_function='len'`." ) agg = agg.len else msg = "invalid input for `aggregate_function` argument: #{aggregate_function.inspect}" raise ArgumentError, msg end elsif aggregate_function.nil? agg = agg.item(allow_empty: true) else agg = aggregate_function end if on_columns.is_a?(DataFrame) on_cols = on_columns elsif on_columns.is_a?(Series) on_cols = on_columns.to_frame else on_cols = Series.new(on_columns).to_frame end _from_rbldf( _ldf.pivot( on_selector._rbselector, on_cols._df, index_selector._rbselector, values_selector._rbselector, agg._rbexpr, maintain_order, separator ) ) end |
#profile(engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Array
Profile a LazyFrame.
This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.
The units of the timings are microseconds.
911 912 913 914 915 916 917 918 919 920 921 |
# File 'lib/polars/lazy_frame.rb', line 911 def profile( engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) ldf = _ldf.with_optimizations(optimizations._rboptflags) df_rb, timings_rb = ldf.profile [Utils.wrap_df(df_rb), Utils.wrap_df(timings_rb)] end |
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
4148 4149 4150 4151 |
# File 'lib/polars/lazy_frame.rb', line 4148 def quantile(quantile, interpolation: "nearest") quantile = Utils.parse_into_expression(quantile, str_as_lit: false) _from_rbldf(_ldf.quantile(quantile, interpolation)) end |
#remove(*predicates, **constraints) ⇒ LazyFrame
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to true are retained
(this includes rows where the predicate evaluates as null).
2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 |
# File 'lib/polars/lazy_frame.rb', line 2023 def remove( *predicates, **constraints ) if constraints.empty? # early-exit conditions (exclude/include all rows) if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass)) return clear end if predicates.length == 1 && predicates[0].is_a?(FalseClass) return dup end end _filter( predicates: predicates, constraints: constraints, invert: true ) end |
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
3414 3415 3416 3417 3418 3419 3420 3421 3422 |
# File 'lib/polars/lazy_frame.rb', line 3414 def rename(mapping, strict: true) if mapping.respond_to?(:call) select(F.all.name.map(&mapping)) else existing = mapping.keys _new = mapping.values _from_rbldf(_ldf.rename(existing, _new, strict)) end end |
#reverse ⇒ LazyFrame
Reverse the DataFrame.
3447 3448 3449 |
# File 'lib/polars/lazy_frame.rb', line 3447 def reverse _from_rbldf(_ldf.reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ LazyFrame
Create rolling groups based on a time column.
Different from a dynamic_group_by the windows are now determined by the
individual values and are not of constant intervals. For constant intervals
use group_by_dynamic.
The period and offset arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 |
# File 'lib/polars/lazy_frame.rb', line 2284 def rolling( index_column:, period:, offset: nil, closed: "right", group_by: nil ) index_column = Utils.parse_into_expression(index_column) if offset.nil? offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period)) end rbexprs_by = ( !group_by.nil? ? Utils.parse_into_list_of_expressions(group_by) : [] ) period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by) LazyGroupBy.new(lgb) end |
#schema ⇒ Hash
Get the schema.
148 149 150 |
# File 'lib/polars/lazy_frame.rb', line 148 def schema _ldf.collect_schema end |
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
2132 2133 2134 2135 2136 2137 2138 2139 |
# File 'lib/polars/lazy_frame.rb', line 2132 def select(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select(rbexprs)) end |
#select_seq(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this LazyFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
2155 2156 2157 2158 2159 2160 2161 2162 |
# File 'lib/polars/lazy_frame.rb', line 2155 def select_seq(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0 rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select_seq(rbexprs)) end |
#serialize(file = nil, format: "binary") ⇒ Object
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Serialize the logical plan of this LazyFrame to a file or string.
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
# File 'lib/polars/lazy_frame.rb', line 214 def serialize(file = nil, format: "binary") if format == "binary" raise Todo unless _ldf.respond_to?(:serialize_binary) serializer = _ldf.method(:serialize_binary) elsif format == "json" msg = "'json' serialization format of LazyFrame is deprecated" warn msg serializer = _ldf.method(:serialize_json) else msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}" raise ArgumentError, msg end Utils.serialize_polars_object(serializer, file) end |
#set_sorted(column, *more_columns, descending: false, nulls_last: false) ⇒ LazyFrame
This can lead to incorrect results if the data is NOT sorted! Use with care!
Flag a column as sorted.
This can speed up future operations.
4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 |
# File 'lib/polars/lazy_frame.rb', line 4844 def set_sorted( column, *more_columns, descending: false, nulls_last: false ) if !Utils.strlike?(column) msg = "expected a 'str' for argument 'column' in 'set_sorted'" raise TypeError, msg end if Utils.bool?(descending) ds = [descending] else ds = descending end if Utils.bool?(nulls_last) nl = [nulls_last] else nl = nulls_last end _from_rbldf( _ldf.hint_sorted( [column] + more_columns, ds, nl ) ) end |
#shift(n = 1, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
3493 3494 3495 3496 3497 3498 3499 |
# File 'lib/polars/lazy_frame.rb', line 3493 def shift(n = 1, fill_value: nil) if !fill_value.nil? fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true) end n = Utils.parse_into_expression(n) _from_rbldf(_ldf.shift(n, fill_value)) end |
#show_graph(optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
Show a plot of the query plan.
Note that Graphviz must be installed to render the visualization (if not already present, you can download it here: https://graphviz.org/download.
559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 |
# File 'lib/polars/lazy_frame.rb', line 559 def show_graph( optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if engine == "streaming" issue_unstable_warning("streaming mode is considered unstable.") end optimizations = optimizations.dup optimizations._rboptflags.streaming = engine == "streaming" _ldf = self._ldf.with_optimizations(optimizations._rboptflags) if plan_stage == "ir" dot = _ldf.to_dot(optimized) elsif plan_stage == "physical" if engine == "streaming" dot = _ldf.to_dot_streaming_phys(optimized) else dot = _ldf.to_dot(optimized) end else error_msg = "invalid plan stage '#{plan_stage}'" raise TypeError, error_msg end Utils.display_dot_graph( dot: dot, show: show, output_path: output_path, raw_output: raw_output ) end |
#sink_csv(path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 |
# File 'lib/polars/lazy_frame.rb', line 1398 def sink_csv( path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) Utils._check_arg_is_1byte("separator", separator, false) Utils._check_arg_is_1byte("quote_char", quote_char, false) engine = _select_engine(engine) _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_csv" ) target = _to_sink_target(path) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder ) ldf_rb = _ldf.sink_csv( target, , include_bom, compression, compression_level, check_extension, include_header, separator.ord, line_terminator, quote_char.ord, batch_size, datetime_format, date_format, time_format, float_scientific, float_precision, decimal_comma, null_value, quote_style ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#sink_ipc(path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 |
# File 'lib/polars/lazy_frame.rb', line 1233 def sink_ipc( path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false ) engine = _select_engine(engine) _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_ipc" ) target = _to_sink_target(path) compat_level_rb = nil if compat_level.nil? compat_level_rb = true else raise Todo end if compression.nil? compression = "uncompressed" end = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder ) ldf_rb = _ldf.sink_ipc( target, , compression, compat_level_rb, record_batch_size, _record_batch_statistics ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#sink_ndjson(path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
This allows streaming results that are larger than RAM to be written to disk.
1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 |
# File 'lib/polars/lazy_frame.rb', line 1537 def sink_ndjson( path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_ndjson" ) target = _to_sink_target(path) = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder ) ldf_rb = _ldf.sink_ndjson( target, compression, compression_level, check_extension, ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Persists a LazyFrame at the provided path.
This allows streaming results that are larger than RAM to be written to disk.
1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 |
# File 'lib/polars/lazy_frame.rb', line 1095 def sink_parquet( path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if statistics == true statistics = { min: true, max: true, distinct_count: false, null_count: true } elsif statistics == false statistics = {} elsif statistics == "full" statistics = { min: true, max: true, distinct_count: true, null_count: true } end _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_parquet" ) target = _to_sink_target(path) = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder ) ldf_rb = _ldf.sink_parquet( target, , compression, compression_level, statistics, row_group_size, data_page_size, , arrow_schema ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
3530 3531 3532 3533 3534 3535 |
# File 'lib/polars/lazy_frame.rb', line 3530 def slice(offset, length = nil) if length && length < 0 raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame" end _from_rbldf(_ldf.slice(offset, length)) end |
#sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
Sorting can be done by:
- A single column name
- An expression
- Multiple expressions
645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 |
# File 'lib/polars/lazy_frame.rb', line 645 def sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) if by.is_a?(::String) && more_by.empty? return _from_rbldf( _ldf.sort( by, descending, nulls_last, maintain_order, multithreaded ) ) end by = Utils.parse_into_list_of_expressions(by, *more_by) descending = Utils.extend_bool(descending, by.length, "descending", "by") nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by") _from_rbldf( _ldf.sort_by_exprs( by, descending, nulls_last, maintain_order, multithreaded ) ) end |
#sql(query, table_name: "self") ⇒ Expr
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- The calling frame is automatically registered as a table in the SQL context
under the name "self". If you want access to the DataFrames and LazyFrames
found in the current globals, use the top-level
Polars.sql. - More control over registration and execution behaviour is available by
using the
SQLContextobject.
Execute a SQL query against the LazyFrame.
724 725 726 727 728 729 |
# File 'lib/polars/lazy_frame.rb', line 724 def sql(query, table_name: "self") ctx = Polars::SQLContext.new name = table_name || "self" ctx.register(name, self) ctx.execute(query) end |
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
3965 3966 3967 |
# File 'lib/polars/lazy_frame.rb', line 3965 def std(ddof: 1) _from_rbldf(_ldf.std(ddof)) end |
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
4057 4058 4059 |
# File 'lib/polars/lazy_frame.rb', line 4057 def sum _from_rbldf(_ldf.sum) end |
#tail(n = 5) ⇒ LazyFrame
Get the last n rows.
3670 3671 3672 |
# File 'lib/polars/lazy_frame.rb', line 3670 def tail(n = 5) _from_rbldf(_ldf.tail(n)) end |
#to_s ⇒ String
Returns a string representing the LazyFrame.
176 177 178 179 180 181 182 |
# File 'lib/polars/lazy_frame.rb', line 176 def to_s " naive plan: (run LazyFrame#explain(optimized: true) to see the optimized plan)\n\n \#{explain(optimized: false)}\n EOS\nend\n" |
#top_k(k, by:, reverse: false) ⇒ LazyFrame
Return the k largest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call :func:sort after this function if you wish the
output to be sorted.
785 786 787 788 789 790 791 792 793 |
# File 'lib/polars/lazy_frame.rb', line 785 def top_k( k, by:, reverse: false ) by = Utils.parse_into_list_of_expressions(by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") _from_rbldf(_ldf.top_k(k, by, reverse)) end |
#unique(maintain_order: false, subset: nil, keep: "any") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List in the DataFrame or
subset.
4264 4265 4266 4267 4268 4269 4270 |
# File 'lib/polars/lazy_frame.rb', line 4264 def unique(maintain_order: false, subset: nil, keep: "any") parsed_subset = nil if !subset.nil? parsed_subset = Utils.parse_into_list_of_expressions(subset, __require_selectors: true) end _from_rbldf(_ldf.unique(maintain_order, parsed_subset, keep)) end |
#unnest(columns, *more_columns, separator: nil) ⇒ LazyFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame on the location of the
struct type.
4778 4779 4780 4781 4782 4783 |
# File 'lib/polars/lazy_frame.rb', line 4778 def unnest(columns, *more_columns, separator: nil) subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector( more_columns ) _from_rbldf(_ldf.unnest(subset._rbselector, separator)) end |
#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 |
# File 'lib/polars/lazy_frame.rb', line 4567 def unpivot( on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true ) if !streamable warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated" end selector_on = on.nil? ? Selectors.empty : Utils.parse_list_into_selector(on) selector_index = index.nil? ? Selectors.empty : Utils.parse_list_into_selector(index) _from_rbldf( _ldf.unpivot( selector_on._rbselector, selector_index._rbselector, value_name, variable_name ) ) end |
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This is syntactic sugar for a left/inner join that preserves the order
of the left DataFrame by default, with an optional coalesce when
include_nulls: false.
Update the values in this LazyFrame with the values in other.
4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 |
# File 'lib/polars/lazy_frame.rb', line 4983 def update( other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left" ) Utils.require_same_type(self, other) if ["outer", "outer_coalesce"].include?(how) how = "full" end if !["left", "inner", "full"].include?(how) msg = "`how` must be one of {{'left', 'inner', 'full'}}; found #{how.inspect}" raise ArgumentError, msg end slf = self row_index_used = false if on.nil? if left_on.nil? && right_on.nil? # no keys provided--use row index row_index_used = true row_index_name = "__POLARS_ROW_INDEX" slf = slf.with_row_index(name: row_index_name) other = other.with_row_index(name: row_index_name) left_on = right_on = [row_index_name] else # one of left or right is missing, raise error if left_on.nil? msg = "missing join columns for left frame" raise ArgumentError, msg end if right_on.nil? msg = "missing join columns for right frame" raise ArgumentError, msg end end else # move on into left/right_on to simplify logic left_on = right_on = on end if left_on.is_a?(::String) left_on = [left_on] end if right_on.is_a?(::String) right_on = [right_on] end left_schema = slf.collect_schema left_on.each do |name| if !left_schema.include?(name) msg = "left join column #{name.inspect} not found" raise ArgumentError, msg end end right_schema = other.collect_schema right_on.each do |name| if !right_schema.include?(name) msg = "right join column #{name.inspect} not found" raise ArgumentError, msg end end # no need to join if *only* join columns are in other (inner/left update only) if how != "full" && right_schema.length == right_on.length if row_index_used return slf.drop(row_index_name) end return slf end # only use non-idx right columns present in left frame right_other = Set.new(right_schema.to_h.keys).intersection(left_schema.to_h.keys) - Set.new(right_on) # When include_nulls is true, we need to distinguish records after the join that # were originally null in the right frame, as opposed to records that were null # because the key was missing from the right frame. # Add a validity column to track whether row was matched or not. if include_nulls validity = ["__POLARS_VALIDITY"] other = other.with_columns(F.lit(true).alias(validity[0])) else validity = [] end tmp_name = "__POLARS_RIGHT" drop_columns = right_other.map { |name| "#{name}#{tmp_name}" } + validity result = ( slf.join( other.select(*right_on, *right_other, *validity), left_on: left_on, right_on: right_on, how: how, suffix: tmp_name, coalesce: true, maintain_order: maintain_order ) .with_columns( right_other.map do |name| ( if include_nulls # use left value only when right value failed to join F.when(F.col(validity).is_null) .then(F.col(name)) .otherwise(F.col("#{name}#{tmp_name}")) else F.coalesce(["#{name}#{tmp_name}", F.col(name)]) end ).alias(name) end ) .drop(drop_columns) ) if row_index_used result = result.drop(row_index_name) end _from_rbldf(result._ldf) end |
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
3997 3998 3999 |
# File 'lib/polars/lazy_frame.rb', line 3997 def var(ddof: 1) _from_rbldf(_ldf.var(ddof)) end |
#width ⇒ Integer
Get the width of the LazyFrame.
160 161 162 |
# File 'lib/polars/lazy_frame.rb', line 160 def width _ldf.collect_schema.length end |
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
3274 3275 3276 3277 3278 3279 3280 |
# File 'lib/polars/lazy_frame.rb', line 3274 def with_columns(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify) _from_rbldf(_ldf.with_columns(rbexprs)) end |
#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame
Add columns to this LazyFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 |
# File 'lib/polars/lazy_frame.rb', line 3298 def with_columns_seq( *exprs, **named_exprs ) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0 rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.with_columns_seq(rbexprs)) end |
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Add a column at index 0 that counts the rows.
3756 3757 3758 |
# File 'lib/polars/lazy_frame.rb', line 3756 def with_row_index(name: "index", offset: 0) _from_rbldf(_ldf.with_row_index(name, offset)) end |