Class: Polars::LazyFrame
- Inherits:
-
Object
- Object
- Polars::LazyFrame
- Defined in:
- lib/polars/lazy_frame.rb
Overview
Representation of a Lazy computation graph/query against a DataFrame.
Class Method Summary collapse
-
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
Instance Method Summary collapse
-
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
-
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
-
#clear(n = 0) ⇒ LazyFrame
(also: #cleared)
Create an empty copy of the current LazyFrame.
-
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
-
#columns ⇒ Array
Get or set column names.
-
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
-
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
-
#drop(*columns) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
-
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop rows with null values from this LazyFrame.
-
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
-
#explode(columns) ⇒ LazyFrame
Explode lists to long format.
-
#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
-
#fill_nan(fill_value) ⇒ LazyFrame
Fill floating point NaN values.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
-
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#first ⇒ LazyFrame
Get the first row of the DataFrame.
-
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy
(also: #groupby, #group)
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window") ⇒ DataFrame
(also: #groupby_dynamic)
Group based on a time value (or index value of type
:i32
,:i64
). -
#head(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
constructor
Create a new LazyFrame.
-
#interpolate ⇒ LazyFrame
Interpolate intermediate values.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, allow_parallel: true, force_parallel: false, coalesce: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true) ⇒ LazyFrame
Perform an asof join.
-
#last ⇒ LazyFrame
Get the last row of the DataFrame.
-
#lazy ⇒ LazyFrame
Return lazy representation, i.e.
-
#limit(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
-
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
-
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
-
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
-
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
-
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
-
#reverse ⇒ LazyFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ LazyFrame
(also: #group_by_rolling, #groupby_rolling)
Create rolling groups based on a time column.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
-
#set_sorted(column, descending: false) ⇒ LazyFrame
Indicate that one or multiple columns are sorted.
-
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
-
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
-
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
-
#sink_ipc(path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
-
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
-
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true, storage_options: nil, retries: 2) ⇒ DataFrame
Persists a LazyFrame at the provided path.
-
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
-
#sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
-
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
-
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
-
#tail(n = 5) ⇒ LazyFrame
Get the last
n
rows. -
#take_every(n) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
-
#to_s ⇒ String
Returns a string representing the LazyFrame.
-
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
-
#unnest(names) ⇒ LazyFrame
Decompose a struct into its fields.
-
#unpivot(on, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
(also: #melt)
Unpivot a DataFrame from wide to long format.
-
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
-
#width ⇒ Integer
Get the width of the LazyFrame.
-
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
-
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
-
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
-
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame
(also: #with_row_count)
Add a column at index 0 that counts the rows.
-
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
Create a new LazyFrame.
8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# File 'lib/polars/lazy_frame.rb', line 8 def initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) self._ldf = ( DataFrame.new( data, schema: schema, schema_overrides: schema_overrides, orient: orient, infer_schema_length: infer_schema_length, nan_to_null: nan_to_null ) .lazy ._ldf ) end |
Class Method Details
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
39 40 41 42 43 44 45 |
# File 'lib/polars/lazy_frame.rb', line 39 def self.read_json(file) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end Utils.wrap_ldf(RbLazyFrame.read_json(file)) end |
Instance Method Details
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
877 878 879 |
# File 'lib/polars/lazy_frame.rb', line 877 def cache _from_rbldf(_ldf.cache) end |
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 |
# File 'lib/polars/lazy_frame.rb', line 930 def cast(dtypes, strict: true) if !dtypes.is_a?(Hash) return _from_rbldf(_ldf.cast_all(dtypes, strict)) end cast_map = {} dtypes.each do |c, dtype| dtype = Utils.parse_into_dtype(dtype) cast_map.merge!( c.is_a?(::String) ? {c => dtype} : Utils.(self, c).to_h { |x| [x, dtype] } ) end _from_rbldf(_ldf.cast(cast_map, strict)) end |
#clear(n = 0) ⇒ LazyFrame Also known as: cleared
Create an empty copy of the current LazyFrame.
The copy has an identical schema but no data.
982 983 984 |
# File 'lib/polars/lazy_frame.rb', line 982 def clear(n = 0) DataFrame.new(columns: schema).clear(n).lazy end |
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
Note: use #fetch if you want to run your query on the first n
rows
only. This can be a huge time saver in debugging queries.
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 |
# File 'lib/polars/lazy_frame.rb', line 333 def collect( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false ) if no_optimization predicate_pushdown = false projection_pushdown = false slice_pushdown = false common_subplan_elimination = false comm_subexpr_elim = false end if allow_streaming common_subplan_elimination = false end ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, _eager ) Utils.wrap_df(ldf.collect) end |
#columns ⇒ Array
Get or set column names.
65 66 67 |
# File 'lib/polars/lazy_frame.rb', line 65 def columns _ldf.collect_schema.keys end |
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
# File 'lib/polars/lazy_frame.rb', line 199 def describe_optimized_plan( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false ) ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, false ) ldf.describe_optimized_plan end |
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
192 193 194 |
# File 'lib/polars/lazy_frame.rb', line 192 def describe_plan _ldf.describe_plan end |
#drop(*columns) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
2229 2230 2231 2232 |
# File 'lib/polars/lazy_frame.rb', line 2229 def drop(*columns) drop_cols = Utils.(self, *columns) _from_rbldf(_ldf.drop(drop_cols)) end |
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop rows with null values from this LazyFrame.
3131 3132 3133 3134 3135 3136 |
# File 'lib/polars/lazy_frame.rb', line 3131 def drop_nulls(subset: nil) if !subset.nil? && !subset.is_a?(::Array) subset = [subset] end _from_rbldf(_ldf.drop_nulls(subset)) end |
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
83 84 85 |
# File 'lib/polars/lazy_frame.rb', line 83 def dtypes _ldf.collect_schema.values end |
#explode(columns) ⇒ LazyFrame
Explode lists to long format.
3032 3033 3034 3035 |
# File 'lib/polars/lazy_frame.rb', line 3032 def explode(columns) columns = Utils.parse_into_list_of_expressions(columns) _from_rbldf(_ldf.explode(columns)) end |
#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
Fetch is like a #collect operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.
Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.
820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 |
# File 'lib/polars/lazy_frame.rb', line 820 def fetch( n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false ) if no_optimization predicate_pushdown = false projection_pushdown = false slice_pushdown = false common_subplan_elimination = false end ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, false ) Utils.wrap_df(ldf.fetch(n_rows)) end |
#fill_nan(fill_value) ⇒ LazyFrame
Note that floating point NaN (Not a Number) are not missing values!
To replace missing values, use fill_null
instead.
Fill floating point NaN values.
2807 2808 2809 2810 2811 2812 |
# File 'lib/polars/lazy_frame.rb', line 2807 def fill_nan(fill_value) if !fill_value.is_a?(Expr) fill_value = F.lit(fill_value) end _from_rbldf(_ldf.fill_nan(fill_value._rbexpr)) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
2772 2773 2774 |
# File 'lib/polars/lazy_frame.rb', line 2772 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) select(Polars.all.fill_null(value, strategy: strategy, limit: limit)) end |
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
1025 1026 1027 1028 1029 1030 1031 |
# File 'lib/polars/lazy_frame.rb', line 1025 def filter(predicate) _from_rbldf( _ldf.filter( Utils.parse_into_expression(predicate, str_as_lit: false) ) ) end |
#first ⇒ LazyFrame
Get the first row of the DataFrame.
2641 2642 2643 |
# File 'lib/polars/lazy_frame.rb', line 2641 def first slice(0, 1) end |
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy Also known as: groupby, group
Start a group by operation.
1162 1163 1164 1165 1166 |
# File 'lib/polars/lazy_frame.rb', line 1162 def group_by(*by, maintain_order: false, **named_by) exprs = Utils.parse_into_list_of_expressions(*by, **named_by) lgb = _ldf.group_by(exprs, maintain_order) LazyGroupBy.new(lgb) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window") ⇒ DataFrame Also known as: groupby_dynamic
Group based on a time value (or index value of type :i32
, :i64
).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every
, period
and offset
arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 |
# File 'lib/polars/lazy_frame.rb', line 1509 def group_by_dynamic( index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window" ) if !truncate.nil? label = truncate ? "left" : "datapoint" end index_column = Utils.parse_into_expression(index_column, str_as_lit: false) if offset.nil? offset = period.nil? ? "-#{every}" : "0ns" end if period.nil? period = every end period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) every = Utils.parse_as_duration_string(every) rbexprs_by = by.nil? ? [] : Utils.parse_into_list_of_expressions(by) lgb = _ldf.group_by_dynamic( index_column, every, period, offset, label, include_boundaries, closed, rbexprs_by, start_by ) LazyGroupBy.new(lgb) end |
#head(n = 5) ⇒ LazyFrame
2546 2547 2548 |
# File 'lib/polars/lazy_frame.rb', line 2546 def head(n = 5) slice(0, n) end |
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
120 121 122 |
# File 'lib/polars/lazy_frame.rb', line 120 def include?(key) columns.include?(key) end |
#interpolate ⇒ LazyFrame
Interpolate intermediate values. The interpolation method is linear.
3234 3235 3236 |
# File 'lib/polars/lazy_frame.rb', line 3234 def interpolate select(F.col("*").interpolate) end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, allow_parallel: true, force_parallel: false, coalesce: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 |
# File 'lib/polars/lazy_frame.rb', line 1996 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, allow_parallel: true, force_parallel: false, coalesce: nil ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if how == "outer" how = "full" elsif how == "cross" return _from_rbldf( _ldf.join( other._ldf, [], [], allow_parallel, join_nulls, force_parallel, how, suffix, validate, coalesce ) ) end if !on.nil? rbexprs = Utils.parse_into_list_of_expressions(on) rbexprs_left = rbexprs rbexprs_right = rbexprs elsif !left_on.nil? && !right_on.nil? rbexprs_left = Utils.parse_into_list_of_expressions(left_on) rbexprs_right = Utils.parse_into_list_of_expressions(right_on) else raise ArgumentError, "must specify `on` OR `left_on` and `right_on`" end _from_rbldf( self._ldf.join( other._ldf, rbexprs_left, rbexprs_right, allow_parallel, force_parallel, join_nulls, how, suffix, validate, coalesce ) ) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true) ⇒ LazyFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 |
# File 'lib/polars/lazy_frame.rb', line 1805 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if on.is_a?(::String) left_on = on right_on = on end if left_on.nil? || right_on.nil? raise ArgumentError, "You should pass the column to join on as an argument." end if by_left.is_a?(::String) || by_left.is_a?(Expr) by_left_ = [by_left] else by_left_ = by_left end if by_right.is_a?(::String) || by_right.is_a?(Expr) by_right_ = [by_right] else by_right_ = by_right end if by.is_a?(::String) by_left_ = [by] by_right_ = [by] elsif by.is_a?(::Array) by_left_ = by by_right_ = by end tolerance_str = nil tolerance_num = nil if tolerance.is_a?(::String) tolerance_str = tolerance else tolerance_num = tolerance end _from_rbldf( _ldf.join_asof( other._ldf, Polars.col(left_on)._rbexpr, Polars.col(right_on)._rbexpr, by_left_, by_right_, allow_parallel, force_parallel, suffix, strategy, tolerance_num, tolerance_str, coalesce ) ) end |
#last ⇒ LazyFrame
Get the last row of the DataFrame.
2616 2617 2618 |
# File 'lib/polars/lazy_frame.rb', line 2616 def last tail(1) end |
#lazy ⇒ LazyFrame
Return lazy representation, i.e. itself.
Useful for writing code that expects either a DataFrame
or
LazyFrame
.
870 871 872 |
# File 'lib/polars/lazy_frame.rb', line 870 def lazy self end |
#limit(n = 5) ⇒ LazyFrame
2496 2497 2498 |
# File 'lib/polars/lazy_frame.rb', line 2496 def limit(n = 5) head(n) end |
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
2894 2895 2896 |
# File 'lib/polars/lazy_frame.rb', line 2894 def max _from_rbldf(_ldf.max) end |
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
2954 2955 2956 |
# File 'lib/polars/lazy_frame.rb', line 2954 def mean _from_rbldf(_ldf.mean) end |
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
2974 2975 2976 |
# File 'lib/polars/lazy_frame.rb', line 2974 def median _from_rbldf(_ldf.median) end |
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
3334 3335 3336 |
# File 'lib/polars/lazy_frame.rb', line 3334 def merge_sorted(other, key) _from_rbldf(_ldf.merge_sorted(other._ldf, key)) end |
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
2914 2915 2916 |
# File 'lib/polars/lazy_frame.rb', line 2914 def min _from_rbldf(_ldf.min) end |
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
185 186 187 |
# File 'lib/polars/lazy_frame.rb', line 185 def pipe(func, *args, **kwargs, &block) func.call(self, *args, **kwargs, &block) end |
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
2999 3000 3001 3002 |
# File 'lib/polars/lazy_frame.rb', line 2999 def quantile(quantile, interpolation: "nearest") quantile = Utils.parse_into_expression(quantile, str_as_lit: false) _from_rbldf(_ldf.quantile(quantile, interpolation)) end |
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
2279 2280 2281 2282 2283 2284 2285 2286 2287 |
# File 'lib/polars/lazy_frame.rb', line 2279 def rename(mapping, strict: true) if mapping.respond_to?(:call) select(F.all.name.map(&mapping)) else existing = mapping.keys _new = mapping.values _from_rbldf(_ldf.rename(existing, _new, strict)) end end |
#reverse ⇒ LazyFrame
Reverse the DataFrame.
2312 2313 2314 |
# File 'lib/polars/lazy_frame.rb', line 2312 def reverse _from_rbldf(_ldf.reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ LazyFrame Also known as: group_by_rolling, groupby_rolling
Create rolling groups based on a time column.
Also works for index values of type :i32
or :i64
.
Different from a dynamic_group_by
the windows are now determined by the
individual values and are not of constant intervals. For constant intervals
use group_by_dynamic.
The period
and offset
arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 |
# File 'lib/polars/lazy_frame.rb', line 1254 def rolling( index_column:, period:, offset: nil, closed: "right", by: nil ) index_column = Utils.parse_into_expression(index_column) if offset.nil? offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period)) end rbexprs_by = ( !by.nil? ? Utils.parse_into_list_of_expressions(by) : [] ) period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by) LazyGroupBy.new(lgb) end |
#schema ⇒ Hash
Get the schema.
101 102 103 |
# File 'lib/polars/lazy_frame.rb', line 101 def schema _ldf.collect_schema end |
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
1121 1122 1123 1124 1125 1126 1127 1128 |
# File 'lib/polars/lazy_frame.rb', line 1121 def select(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select(rbexprs)) end |
#set_sorted(column, descending: false) ⇒ LazyFrame
Indicate that one or multiple columns are sorted.
3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 |
# File 'lib/polars/lazy_frame.rb', line 3346 def set_sorted( column, descending: false ) if !Utils.strlike?(column) msg = "expected a 'str' for argument 'column' in 'set_sorted'" raise TypeError, msg end with_columns(F.col(column).set_sorted(descending: descending)) end |
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
2358 2359 2360 2361 2362 2363 2364 |
# File 'lib/polars/lazy_frame.rb', line 2358 def shift(n, fill_value: nil) if !fill_value.nil? fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true) end n = Utils.parse_into_expression(n) _from_rbldf(_ldf.shift(n, fill_value)) end |
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
2408 2409 2410 |
# File 'lib/polars/lazy_frame.rb', line 2408 def shift_and_fill(periods, fill_value) shift(periods, fill_value: fill_value) end |
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 |
# File 'lib/polars/lazy_frame.rb', line 628 def sink_csv( path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) Utils._check_arg_is_1byte("separator", separator, false) Utils._check_arg_is_1byte("quote_char", quote_char, false) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_csv( path, include_bom, include_header, separator.ord, line_terminator, quote_char.ord, batch_size, datetime_format, date_format, time_format, float_scientific, float_precision, null_value, quote_style, maintain_order ) end |
#sink_ipc(path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 |
# File 'lib/polars/lazy_frame.rb', line 514 def sink_ipc( path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) # TODO support storage options in Rust = nil retries = 2 lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if &.any? = .to_a else = nil end lf.sink_ipc( path, compression, maintain_order, , retries ) end |
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
This allows streaming results that are larger than RAM to be written to disk.
709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 |
# File 'lib/polars/lazy_frame.rb', line 709 def sink_ndjson( path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2 ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if &.any? = .to_a else = nil end lf.sink_json(path, maintain_order, , retries) end |
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true, storage_options: nil, retries: 2) ⇒ DataFrame
Persists a LazyFrame at the provided path.
This allows streaming results that are larger than RAM to be written to disk.
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 |
# File 'lib/polars/lazy_frame.rb', line 421 def sink_parquet( path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true, storage_options: nil, retries: 2 ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if statistics == true statistics = { min: true, max: true, distinct_count: false, null_count: true } elsif statistics == false statistics = {} elsif statistics == "full" statistics = { min: true, max: true, distinct_count: true, null_count: true } end if &.any? = .to_a else = nil end lf.sink_parquet( path, compression, compression_level, statistics, row_group_size, data_pagesize_limit, maintain_order, , retries ) end |
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
2441 2442 2443 2444 2445 2446 |
# File 'lib/polars/lazy_frame.rb', line 2441 def slice(offset, length = nil) if length && length < 0 raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame" end _from_rbldf(_ldf.slice(offset, length)) end |
#sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
Sorting can be done by:
- A single column name
- An expression
- Multiple expressions
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
# File 'lib/polars/lazy_frame.rb', line 264 def sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) if by.is_a?(::String) && more_by.empty? return _from_rbldf( _ldf.sort( by, reverse, nulls_last, maintain_order, multithreaded ) ) end by = Utils.parse_into_list_of_expressions(by, *more_by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by") _from_rbldf( _ldf.sort_by_exprs( by, reverse, nulls_last, maintain_order, multithreaded ) ) end |
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
2842 2843 2844 |
# File 'lib/polars/lazy_frame.rb', line 2842 def std(ddof: 1) _from_rbldf(_ldf.std(ddof)) end |
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
2934 2935 2936 |
# File 'lib/polars/lazy_frame.rb', line 2934 def sum _from_rbldf(_ldf.sum) end |
#tail(n = 5) ⇒ LazyFrame
Get the last n
rows.
2591 2592 2593 |
# File 'lib/polars/lazy_frame.rb', line 2591 def tail(n = 5) _from_rbldf(_ldf.tail(n)) end |
#take_every(n) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
2699 2700 2701 |
# File 'lib/polars/lazy_frame.rb', line 2699 def take_every(n) select(F.col("*").take_every(n)) end |
#to_s ⇒ String
Returns a string representing the LazyFrame.
132 133 134 135 136 137 138 |
# File 'lib/polars/lazy_frame.rb', line 132 def to_s <<~EOS naive plan: (run LazyFrame#describe_optimized_plan to see the optimized plan) #{describe_plan} EOS end |
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List
in the DataFrame or
subset.
3098 3099 3100 3101 3102 3103 |
# File 'lib/polars/lazy_frame.rb', line 3098 def unique(maintain_order: true, subset: nil, keep: "first") if !subset.nil? && !subset.is_a?(::Array) subset = [subset] end _from_rbldf(_ldf.unique(maintain_order, subset, keep)) end |
#unnest(names) ⇒ LazyFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame
on the location of the
struct
type.
3289 3290 3291 3292 3293 3294 |
# File 'lib/polars/lazy_frame.rb', line 3289 def unnest(names) if names.is_a?(::String) names = [names] end _from_rbldf(_ldf.unnest(names)) end |
#unpivot(on, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame Also known as: melt
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 |
# File 'lib/polars/lazy_frame.rb', line 3186 def unpivot( on, index: nil, variable_name: nil, value_name: nil, streamable: true ) if !streamable warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated" end on = on.nil? ? [] : Utils.parse_into_list_of_expressions(on) index = index.nil? ? [] : Utils.parse_into_list_of_expressions(index) _from_rbldf( _ldf.unpivot(on, index, value_name, variable_name) ) end |
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
2874 2875 2876 |
# File 'lib/polars/lazy_frame.rb', line 2874 def var(ddof: 1) _from_rbldf(_ldf.var(ddof)) end |
#width ⇒ Integer
Get the width of the LazyFrame.
113 114 115 |
# File 'lib/polars/lazy_frame.rb', line 113 def width _ldf.collect_schema.length end |
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
2169 2170 2171 |
# File 'lib/polars/lazy_frame.rb', line 2169 def with_column(column) with_columns([column]) end |
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
2084 2085 2086 2087 2088 2089 2090 |
# File 'lib/polars/lazy_frame.rb', line 2084 def with_columns(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify) _from_rbldf(_ldf.with_columns(rbexprs)) end |
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
This allows expressions to also access columns from DataFrames that are not part of this one.
2121 2122 2123 2124 2125 2126 2127 |
# File 'lib/polars/lazy_frame.rb', line 2121 def with_context(other) if !other.is_a?(::Array) other = [other] end _from_rbldf(_ldf.with_context(other.map(&:_ldf))) end |
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame Also known as: with_row_count
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Add a column at index 0 that counts the rows.
2677 2678 2679 |
# File 'lib/polars/lazy_frame.rb', line 2677 def with_row_index(name: "index", offset: 0) _from_rbldf(_ldf.with_row_index(name, offset)) end |
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
146 147 148 149 150 151 152 |
# File 'lib/polars/lazy_frame.rb', line 146 def write_json(file) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end _ldf.write_json(file) nil end |