Class: Polars::LazyFrame
- Inherits:
-
Object
- Object
- Polars::LazyFrame
- Defined in:
- lib/polars/lazy_frame.rb
Overview
Representation of a Lazy computation graph/query against a DataFrame.
Class Method Summary collapse
-
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
Instance Method Summary collapse
-
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
-
#clear(n = 0) ⇒ LazyFrame
(also: #cleared)
Create an empty copy of the current LazyFrame.
-
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
-
#columns ⇒ Array
Get or set column names.
-
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
-
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
-
#drop(*columns) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
-
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop rows with null values from this LazyFrame.
-
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
-
#explode(columns) ⇒ LazyFrame
Explode lists to long format.
-
#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
-
#fill_nan(fill_value) ⇒ LazyFrame
Fill floating point NaN values.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
-
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#first ⇒ LazyFrame
Get the first row of the DataFrame.
-
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy
(also: #groupby, #group)
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window") ⇒ DataFrame
(also: #groupby_dynamic)
Group based on a time value (or index value of type
:i32
,:i64
). -
#head(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
constructor
Create a new LazyFrame.
-
#interpolate ⇒ LazyFrame
Interpolate intermediate values.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", join_nulls: false, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Add a join operation to the Logical Plan.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Perform an asof join.
-
#last ⇒ LazyFrame
Get the last row of the DataFrame.
-
#lazy ⇒ LazyFrame
Return lazy representation, i.e.
-
#limit(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
-
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
-
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
-
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
-
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
-
#rename(mapping) ⇒ LazyFrame
Rename column names.
-
#reverse ⇒ LazyFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ LazyFrame
(also: #group_by_rolling, #groupby_rolling)
Create rolling groups based on a time column.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
-
#set_sorted(column, descending: false) ⇒ LazyFrame
Indicate that one or multiple columns are sorted.
-
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
-
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
-
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
-
#sink_ipc(path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
-
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
-
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true) ⇒ DataFrame
Persists a LazyFrame at the provided path.
-
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
-
#sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
-
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
-
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
-
#tail(n = 5) ⇒ LazyFrame
Get the last
n
rows. -
#take_every(n) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
-
#to_s ⇒ String
Returns a string representing the LazyFrame.
-
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
-
#unnest(names) ⇒ LazyFrame
Decompose a struct into its fields.
-
#unpivot(on, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
(also: #melt)
Unpivot a DataFrame from wide to long format.
-
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
-
#width ⇒ Integer
Get the width of the LazyFrame.
-
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
-
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
-
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
-
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame
(also: #with_row_count)
Add a column at index 0 that counts the rows.
-
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
Create a new LazyFrame.
8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# File 'lib/polars/lazy_frame.rb', line 8 def initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) self._ldf = ( DataFrame.new( data, schema: schema, schema_overrides: schema_overrides, orient: orient, infer_schema_length: infer_schema_length, nan_to_null: nan_to_null ) .lazy ._ldf ) end |
Class Method Details
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
39 40 41 42 43 44 45 |
# File 'lib/polars/lazy_frame.rb', line 39 def self.read_json(file) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end Utils.wrap_ldf(RbLazyFrame.read_json(file)) end |
Instance Method Details
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
847 848 849 |
# File 'lib/polars/lazy_frame.rb', line 847 def cache _from_rbldf(_ldf.cache) end |
#clear(n = 0) ⇒ LazyFrame Also known as: cleared
Create an empty copy of the current LazyFrame.
The copy has an identical schema but no data.
891 892 893 |
# File 'lib/polars/lazy_frame.rb', line 891 def clear(n = 0) DataFrame.new(columns: schema).clear(n).lazy end |
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
Note: use #fetch if you want to run your query on the first n
rows
only. This can be a huge time saver in debugging queries.
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 |
# File 'lib/polars/lazy_frame.rb', line 333 def collect( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false ) if no_optimization predicate_pushdown = false projection_pushdown = false slice_pushdown = false common_subplan_elimination = false comm_subexpr_elim = false end if allow_streaming common_subplan_elimination = false end ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, _eager ) Utils.wrap_df(ldf.collect) end |
#columns ⇒ Array
Get or set column names.
65 66 67 |
# File 'lib/polars/lazy_frame.rb', line 65 def columns _ldf.collect_schema.keys end |
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
# File 'lib/polars/lazy_frame.rb', line 199 def describe_optimized_plan( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false ) ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, false ) ldf.describe_optimized_plan end |
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
192 193 194 |
# File 'lib/polars/lazy_frame.rb', line 192 def describe_plan _ldf.describe_plan end |
#drop(*columns) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
1882 1883 1884 1885 |
# File 'lib/polars/lazy_frame.rb', line 1882 def drop(*columns) drop_cols = Utils.(self, *columns) _from_rbldf(_ldf.drop(drop_cols)) end |
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop rows with null values from this LazyFrame.
2467 2468 2469 2470 2471 2472 |
# File 'lib/polars/lazy_frame.rb', line 2467 def drop_nulls(subset: nil) if !subset.nil? && !subset.is_a?(::Array) subset = [subset] end _from_rbldf(_ldf.drop_nulls(subset)) end |
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
83 84 85 |
# File 'lib/polars/lazy_frame.rb', line 83 def dtypes _ldf.collect_schema.values end |
#explode(columns) ⇒ LazyFrame
Explode lists to long format.
2415 2416 2417 2418 |
# File 'lib/polars/lazy_frame.rb', line 2415 def explode(columns) columns = Utils.parse_into_list_of_expressions(columns) _from_rbldf(_ldf.explode(columns)) end |
#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
Fetch is like a #collect operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.
Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.
790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 |
# File 'lib/polars/lazy_frame.rb', line 790 def fetch( n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false ) if no_optimization predicate_pushdown = false projection_pushdown = false slice_pushdown = false common_subplan_elimination = false end ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, false ) Utils.wrap_df(ldf.fetch(n_rows)) end |
#fill_nan(fill_value) ⇒ LazyFrame
Note that floating point NaN (Not a Number) are not missing values!
To replace missing values, use fill_null
instead.
Fill floating point NaN values.
2190 2191 2192 2193 2194 2195 |
# File 'lib/polars/lazy_frame.rb', line 2190 def fill_nan(fill_value) if !fill_value.is_a?(Expr) fill_value = F.lit(fill_value) end _from_rbldf(_ldf.fill_nan(fill_value._rbexpr)) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
2155 2156 2157 |
# File 'lib/polars/lazy_frame.rb', line 2155 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) select(Polars.all.fill_null(value, strategy: strategy, limit: limit)) end |
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
934 935 936 937 938 939 940 |
# File 'lib/polars/lazy_frame.rb', line 934 def filter(predicate) _from_rbldf( _ldf.filter( Utils.parse_into_expression(predicate, str_as_lit: false) ) ) end |
#first ⇒ LazyFrame
Get the first row of the DataFrame.
2090 2091 2092 |
# File 'lib/polars/lazy_frame.rb', line 2090 def first slice(0, 1) end |
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy Also known as: groupby, group
Start a group by operation.
1071 1072 1073 1074 1075 |
# File 'lib/polars/lazy_frame.rb', line 1071 def group_by(*by, maintain_order: false, **named_by) exprs = Utils.parse_into_list_of_expressions(*by, **named_by) lgb = _ldf.group_by(exprs, maintain_order) LazyGroupBy.new(lgb) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window") ⇒ DataFrame Also known as: groupby_dynamic
Group based on a time value (or index value of type :i32
, :i64
).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every
, period
and offset
arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 |
# File 'lib/polars/lazy_frame.rb', line 1418 def group_by_dynamic( index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window" ) if !truncate.nil? label = truncate ? "left" : "datapoint" end index_column = Utils.parse_into_expression(index_column, str_as_lit: false) if offset.nil? offset = period.nil? ? "-#{every}" : "0ns" end if period.nil? period = every end period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) every = Utils.parse_as_duration_string(every) rbexprs_by = by.nil? ? [] : Utils.parse_into_list_of_expressions(by) lgb = _ldf.group_by_dynamic( index_column, every, period, offset, label, include_boundaries, closed, rbexprs_by, start_by ) LazyGroupBy.new(lgb) end |
#head(n = 5) ⇒ LazyFrame
2066 2067 2068 |
# File 'lib/polars/lazy_frame.rb', line 2066 def head(n = 5) slice(0, n) end |
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
120 121 122 |
# File 'lib/polars/lazy_frame.rb', line 120 def include?(key) columns.include?(key) end |
#interpolate ⇒ LazyFrame
Interpolate intermediate values. The interpolation method is linear.
2570 2571 2572 |
# File 'lib/polars/lazy_frame.rb', line 2570 def interpolate select(F.col("*").interpolate) end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", join_nulls: false, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Add a join operation to the Logical Plan.
1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 |
# File 'lib/polars/lazy_frame.rb', line 1702 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", join_nulls: false, allow_parallel: true, force_parallel: false ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if how == "outer" how = "full" elsif how == "cross" return _from_rbldf( _ldf.join( other._ldf, [], [], allow_parallel, join_nulls, force_parallel, how, suffix ) ) end if !on.nil? rbexprs = Utils.parse_into_list_of_expressions(on) rbexprs_left = rbexprs rbexprs_right = rbexprs elsif !left_on.nil? && !right_on.nil? rbexprs_left = Utils.parse_into_list_of_expressions(left_on) rbexprs_right = Utils.parse_into_list_of_expressions(right_on) else raise ArgumentError, "must specify `on` OR `left_on` and `right_on`" end _from_rbldf( self._ldf.join( other._ldf, rbexprs_left, rbexprs_right, allow_parallel, force_parallel, join_nulls, how, suffix, ) ) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 |
# File 'lib/polars/lazy_frame.rb', line 1525 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if on.is_a?(::String) left_on = on right_on = on end if left_on.nil? || right_on.nil? raise ArgumentError, "You should pass the column to join on as an argument." end if by_left.is_a?(::String) || by_left.is_a?(Expr) by_left_ = [by_left] else by_left_ = by_left end if by_right.is_a?(::String) || by_right.is_a?(Expr) by_right_ = [by_right] else by_right_ = by_right end if by.is_a?(::String) by_left_ = [by] by_right_ = [by] elsif by.is_a?(::Array) by_left_ = by by_right_ = by end tolerance_str = nil tolerance_num = nil if tolerance.is_a?(::String) tolerance_str = tolerance else tolerance_num = tolerance end _from_rbldf( _ldf.join_asof( other._ldf, Polars.col(left_on)._rbexpr, Polars.col(right_on)._rbexpr, by_left_, by_right_, allow_parallel, force_parallel, suffix, strategy, tolerance_num, tolerance_str ) ) end |
#last ⇒ LazyFrame
Get the last row of the DataFrame.
2083 2084 2085 |
# File 'lib/polars/lazy_frame.rb', line 2083 def last tail(1) end |
#lazy ⇒ LazyFrame
Return lazy representation, i.e. itself.
Useful for writing code that expects either a DataFrame
or
LazyFrame
.
840 841 842 |
# File 'lib/polars/lazy_frame.rb', line 840 def lazy self end |
#limit(n = 5) ⇒ LazyFrame
2051 2052 2053 |
# File 'lib/polars/lazy_frame.rb', line 2051 def limit(n = 5) head(5) end |
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
2277 2278 2279 |
# File 'lib/polars/lazy_frame.rb', line 2277 def max _from_rbldf(_ldf.max) end |
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
2337 2338 2339 |
# File 'lib/polars/lazy_frame.rb', line 2337 def mean _from_rbldf(_ldf.mean) end |
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
2357 2358 2359 |
# File 'lib/polars/lazy_frame.rb', line 2357 def median _from_rbldf(_ldf.median) end |
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
2670 2671 2672 |
# File 'lib/polars/lazy_frame.rb', line 2670 def merge_sorted(other, key) _from_rbldf(_ldf.merge_sorted(other._ldf, key)) end |
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
2297 2298 2299 |
# File 'lib/polars/lazy_frame.rb', line 2297 def min _from_rbldf(_ldf.min) end |
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
185 186 187 |
# File 'lib/polars/lazy_frame.rb', line 185 def pipe(func, *args, **kwargs, &block) func.call(self, *args, **kwargs, &block) end |
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
2382 2383 2384 2385 |
# File 'lib/polars/lazy_frame.rb', line 2382 def quantile(quantile, interpolation: "nearest") quantile = Utils.parse_into_expression(quantile, str_as_lit: false) _from_rbldf(_ldf.quantile(quantile, interpolation)) end |
#rename(mapping) ⇒ LazyFrame
Rename column names.
1893 1894 1895 1896 1897 |
# File 'lib/polars/lazy_frame.rb', line 1893 def rename(mapping) existing = mapping.keys _new = mapping.values _from_rbldf(_ldf.rename(existing, _new)) end |
#reverse ⇒ LazyFrame
Reverse the DataFrame.
1902 1903 1904 |
# File 'lib/polars/lazy_frame.rb', line 1902 def reverse _from_rbldf(_ldf.reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ LazyFrame Also known as: group_by_rolling, groupby_rolling
Create rolling groups based on a time column.
Also works for index values of type :i32
or :i64
.
Different from a dynamic_group_by
the windows are now determined by the
individual values and are not of constant intervals. For constant intervals
use group_by_dynamic.
The period
and offset
arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 |
# File 'lib/polars/lazy_frame.rb', line 1163 def rolling( index_column:, period:, offset: nil, closed: "right", by: nil ) index_column = Utils.parse_into_expression(index_column) if offset.nil? offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period)) end rbexprs_by = ( !by.nil? ? Utils.parse_into_list_of_expressions(by) : [] ) period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by) LazyGroupBy.new(lgb) end |
#schema ⇒ Hash
Get the schema.
101 102 103 |
# File 'lib/polars/lazy_frame.rb', line 101 def schema _ldf.collect_schema end |
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
1030 1031 1032 1033 1034 1035 1036 1037 |
# File 'lib/polars/lazy_frame.rb', line 1030 def select(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select(rbexprs)) end |
#set_sorted(column, descending: false) ⇒ LazyFrame
Indicate that one or multiple columns are sorted.
2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 |
# File 'lib/polars/lazy_frame.rb', line 2682 def set_sorted( column, descending: false ) if !Utils.strlike?(column) msg = "expected a 'str' for argument 'column' in 'set_sorted'" raise TypeError, msg end with_columns(F.col(column).set_sorted(descending: descending)) end |
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
1948 1949 1950 1951 1952 1953 1954 |
# File 'lib/polars/lazy_frame.rb', line 1948 def shift(n, fill_value: nil) if !fill_value.nil? fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true) end n = Utils.parse_into_expression(n) _from_rbldf(_ldf.shift(n, fill_value)) end |
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
1998 1999 2000 |
# File 'lib/polars/lazy_frame.rb', line 1998 def shift_and_fill(periods, fill_value) shift(periods, fill_value: fill_value) end |
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 |
# File 'lib/polars/lazy_frame.rb', line 606 def sink_csv( path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) Utils._check_arg_is_1byte("separator", separator, false) Utils._check_arg_is_1byte("quote_char", quote_char, false) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_csv( path, include_bom, include_header, separator.ord, line_terminator, quote_char.ord, batch_size, datetime_format, date_format, time_format, float_scientific, float_precision, null_value, quote_style, maintain_order ) end |
#sink_ipc(path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 |
# File 'lib/polars/lazy_frame.rb', line 504 def sink_ipc( path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_ipc( path, compression, maintain_order ) end |
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
This allows streaming results that are larger than RAM to be written to disk.
687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 |
# File 'lib/polars/lazy_frame.rb', line 687 def sink_ndjson( path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_json(path, maintain_order) end |
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true) ⇒ DataFrame
Persists a LazyFrame at the provided path.
This allows streaming results that are larger than RAM to be written to disk.
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 |
# File 'lib/polars/lazy_frame.rb', line 421 def sink_parquet( path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if statistics == true statistics = { min: true, max: true, distinct_count: false, null_count: true } elsif statistics == false statistics = {} elsif statistics == "full" statistics = { min: true, max: true, distinct_count: true, null_count: true } end lf.sink_parquet( path, compression, compression_level, statistics, row_group_size, data_pagesize_limit, maintain_order ) end |
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
2031 2032 2033 2034 2035 2036 |
# File 'lib/polars/lazy_frame.rb', line 2031 def slice(offset, length = nil) if length && length < 0 raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame" end _from_rbldf(_ldf.slice(offset, length)) end |
#sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
Sorting can be done by:
- A single column name
- An expression
- Multiple expressions
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
# File 'lib/polars/lazy_frame.rb', line 264 def sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) if by.is_a?(::String) && more_by.empty? return _from_rbldf( _ldf.sort( by, reverse, nulls_last, maintain_order, multithreaded ) ) end by = Utils.parse_into_list_of_expressions(by, *more_by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by") _from_rbldf( _ldf.sort_by_exprs( by, reverse, nulls_last, maintain_order, multithreaded ) ) end |
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
2225 2226 2227 |
# File 'lib/polars/lazy_frame.rb', line 2225 def std(ddof: 1) _from_rbldf(_ldf.std(ddof)) end |
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
2317 2318 2319 |
# File 'lib/polars/lazy_frame.rb', line 2317 def sum _from_rbldf(_ldf.sum) end |
#tail(n = 5) ⇒ LazyFrame
Get the last n
rows.
2076 2077 2078 |
# File 'lib/polars/lazy_frame.rb', line 2076 def tail(n = 5) _from_rbldf(_ldf.tail(n)) end |
#take_every(n) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
2148 2149 2150 |
# File 'lib/polars/lazy_frame.rb', line 2148 def take_every(n) select(F.col("*").take_every(n)) end |
#to_s ⇒ String
Returns a string representing the LazyFrame.
132 133 134 135 136 137 138 |
# File 'lib/polars/lazy_frame.rb', line 132 def to_s <<~EOS naive plan: (run LazyFrame#describe_optimized_plan to see the optimized plan) #{describe_plan} EOS end |
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List
in the DataFrame or
subset.
2434 2435 2436 2437 2438 2439 |
# File 'lib/polars/lazy_frame.rb', line 2434 def unique(maintain_order: true, subset: nil, keep: "first") if !subset.nil? && !subset.is_a?(::Array) subset = [subset] end _from_rbldf(_ldf.unique(maintain_order, subset, keep)) end |
#unnest(names) ⇒ LazyFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame
on the location of the
struct
type.
2625 2626 2627 2628 2629 2630 |
# File 'lib/polars/lazy_frame.rb', line 2625 def unnest(names) if names.is_a?(::String) names = [names] end _from_rbldf(_ldf.unnest(names)) end |
#unpivot(on, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame Also known as: melt
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 |
# File 'lib/polars/lazy_frame.rb', line 2522 def unpivot( on, index: nil, variable_name: nil, value_name: nil, streamable: true ) if !streamable warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated" end on = on.nil? ? [] : Utils.(self, on) index = index.nil? ? [] : Utils.(self, index) _from_rbldf( _ldf.unpivot(on, index, value_name, variable_name) ) end |
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
2257 2258 2259 |
# File 'lib/polars/lazy_frame.rb', line 2257 def var(ddof: 1) _from_rbldf(_ldf.var(ddof)) end |
#width ⇒ Integer
Get the width of the LazyFrame.
113 114 115 |
# File 'lib/polars/lazy_frame.rb', line 113 def width _ldf.collect_schema.length end |
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
1871 1872 1873 |
# File 'lib/polars/lazy_frame.rb', line 1871 def with_column(column) with_columns([column]) end |
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
1786 1787 1788 1789 1790 1791 1792 |
# File 'lib/polars/lazy_frame.rb', line 1786 def with_columns(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify) _from_rbldf(_ldf.with_columns(rbexprs)) end |
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
This allows expressions to also access columns from DataFrames that are not part of this one.
1823 1824 1825 1826 1827 1828 1829 |
# File 'lib/polars/lazy_frame.rb', line 1823 def with_context(other) if !other.is_a?(::Array) other = [other] end _from_rbldf(_ldf.with_context(other.map(&:_ldf))) end |
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame Also known as: with_row_count
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Add a column at index 0 that counts the rows.
2126 2127 2128 |
# File 'lib/polars/lazy_frame.rb', line 2126 def with_row_index(name: "index", offset: 0) _from_rbldf(_ldf.with_row_index(name, offset)) end |
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
146 147 148 149 150 151 152 |
# File 'lib/polars/lazy_frame.rb', line 146 def write_json(file) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end _ldf.write_json(file) nil end |