Class: Polars::GroupBy
- Inherits:
-
Object
- Object
- Polars::GroupBy
- Defined in:
- lib/polars/group_by.rb
Overview
Starts a new GroupBy operation.
Instance Method Summary collapse
-
#agg(*aggs, **named_aggs) ⇒ DataFrame
Compute aggregations for each group of a group by operation.
-
#all ⇒ DataFrame
Aggregate the groups into Series.
-
#count ⇒ DataFrame
Count the number of values in each group.
-
#each ⇒ Object
Allows iteration over the groups of the group by operation.
-
#first(ignore_nulls: false) ⇒ DataFrame
Aggregate the first values in the group.
-
#having(*predicates) ⇒ GroupBy
Filter groups with a list of predicates after aggregation.
-
#head(n = 5) ⇒ DataFrame
Get the first
nrows of each group. -
#last(ignore_nulls: false) ⇒ DataFrame
Aggregate the last values in the group.
-
#len(name: nil) ⇒ DataFrame
Return the number of rows in each group.
-
#map_groups(&function) ⇒ DataFrame
Apply a custom/user-defined function (UDF) over the groups as a sub-DataFrame.
-
#max ⇒ DataFrame
Reduce the groups to the maximal value.
-
#mean ⇒ DataFrame
Reduce the groups to the mean values.
-
#median ⇒ DataFrame
Return the median per group.
-
#min ⇒ DataFrame
Reduce the groups to the minimal value.
-
#n_unique ⇒ DataFrame
Count the unique values per group.
-
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Compute the quantile per group.
-
#sum ⇒ DataFrame
Reduce the groups to the sum.
-
#tail(n = 5) ⇒ DataFrame
Get the last
nrows of each group.
Instance Method Details
#agg(*aggs, **named_aggs) ⇒ DataFrame
Compute aggregations for each group of a group by operation.
205 206 207 208 209 |
# File 'lib/polars/group_by.rb', line 205 def agg(*aggs, **named_aggs) _lgb .agg(*aggs, **named_aggs) .collect(optimizations: QueryOptFlags.none) end |
#all ⇒ DataFrame
Aggregate the groups into Series.
386 387 388 |
# File 'lib/polars/group_by.rb', line 386 def all agg(F.all) end |
#count ⇒ DataFrame
Count the number of values in each group.
611 612 613 |
# File 'lib/polars/group_by.rb', line 611 def count agg(Polars.len.alias("count")) end |
#each ⇒ Object
Allows iteration over the groups of the group by operation.
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/polars/group_by.rb', line 37 def each return to_enum(:each) unless block_given? temp_col = "__POLARS_GB_GROUP_INDICES" groups_df = @df.lazy .with_row_index(name: temp_col) .group_by(@by, **@named_by, maintain_order: @maintain_order) .agg(Polars.col(temp_col)) .collect(optimizations: QueryOptFlags.none) group_names = groups_df.select(Polars.all.exclude(temp_col)) # When grouping by a single column, group name is a single value # When grouping by multiple columns, group name is a tuple of values if @by.is_a?(::String) || @by.is_a?(Expr) _group_names = group_names.to_series.each else _group_names = group_names.iter_rows end _group_indices = groups_df.select(temp_col).to_series _current_index = 0 while _current_index < _group_indices.length group_name = _group_names.next group_data = @df[_group_indices[_current_index]] _current_index += 1 yield group_name, group_data end end |
#first(ignore_nulls: false) ⇒ DataFrame
Aggregate the first values in the group.
461 462 463 |
# File 'lib/polars/group_by.rb', line 461 def first(ignore_nulls: false) agg(F.all.first(ignore_nulls: ignore_nulls)) end |
#having(*predicates) ⇒ GroupBy
Filter groups with a list of predicates after aggregation.
Using this method is equivalent to adding the predicates to the aggregation and filtering afterwards.
This method can be chained and all conditions will be combined using &.
101 102 103 104 105 106 107 108 109 |
# File 'lib/polars/group_by.rb', line 101 def having(*predicates) GroupBy.new( @df, *@by, maintain_order: @maintain_order, predicates: Utils._chain_predicates(@predicates, predicates), **@named_by ) end |
#head(n = 5) ⇒ DataFrame
Get the first n rows of each group.
317 318 319 |
# File 'lib/polars/group_by.rb', line 317 def head(n = 5) _lgb.head(n).collect(optimizations: QueryOptFlags._eager) end |
#last(ignore_nulls: false) ⇒ DataFrame
Aggregate the last values in the group.
495 496 497 |
# File 'lib/polars/group_by.rb', line 495 def last(ignore_nulls: false) agg(F.all.last(ignore_nulls: ignore_nulls)) end |
#len(name: nil) ⇒ DataFrame
Return the number of rows in each group.
423 424 425 426 427 428 429 |
# File 'lib/polars/group_by.rb', line 423 def len(name: nil) len_expr = F.len if !name.nil? len_expr = len_expr.alias(name) end agg(len_expr) end |
#map_groups(&function) ⇒ DataFrame
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
Apply a custom/user-defined function (UDF) over the groups as a sub-DataFrame.
Implementing logic using a Ruby function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:
- The native expression engine runs in Rust; UDFs run in Ruby.
- Use of Ruby UDFs forces the DataFrame to be materialized in memory.
- Polars-native expressions can be parallelised (UDFs cannot).
- Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
# File 'lib/polars/group_by.rb', line 252 def map_groups(&function) if @predicates&.any? msg = "cannot call `map_groups` when filtering groups with `having`" raise TypeError, msg end if @named_by&.any? msg = "cannot call `map_groups` when grouping by named expressions" raise TypeError, msg end if !@by.all? { |c| Utils.strlike?(c) } msg = "cannot call `map_groups` when grouping by an expression" raise TypeError, msg end by_strs = @by.map(&:to_s) @df.class._from_rbdf( @df._df.group_by_map_groups(by_strs, function, @maintain_order) ) end |
#max ⇒ DataFrame
Reduce the groups to the maximal value.
582 583 584 |
# File 'lib/polars/group_by.rb', line 582 def max agg(Polars.all.max) end |
#mean ⇒ DataFrame
Reduce the groups to the mean values.
640 641 642 |
# File 'lib/polars/group_by.rb', line 640 def mean agg(Polars.all.mean) end |
#median ⇒ DataFrame
Return the median per group.
727 728 729 |
# File 'lib/polars/group_by.rb', line 727 def median agg(Polars.all.median) end |
#min ⇒ DataFrame
Reduce the groups to the minimal value.
553 554 555 |
# File 'lib/polars/group_by.rb', line 553 def min agg(Polars.all.min) end |
#n_unique ⇒ DataFrame
Count the unique values per group.
667 668 669 |
# File 'lib/polars/group_by.rb', line 667 def n_unique agg(Polars.all.n_unique) end |
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Compute the quantile per group.
700 701 702 |
# File 'lib/polars/group_by.rb', line 700 def quantile(quantile, interpolation: "nearest") agg(Polars.all.quantile(quantile, interpolation: interpolation)) end |
#sum ⇒ DataFrame
Reduce the groups to the sum.
524 525 526 |
# File 'lib/polars/group_by.rb', line 524 def sum agg(Polars.all.sum) end |
#tail(n = 5) ⇒ DataFrame
Get the last n rows of each group.
365 366 367 |
# File 'lib/polars/group_by.rb', line 365 def tail(n = 5) _lgb.tail(n).collect(optimizations: QueryOptFlags._eager) end |