`riptable.rt_groupby`

Classes

GroupBy

param dataset:: The dataset object

class riptable.rt_groupby.GroupBy(dataset, keys=None, filter=None, ordered=None, sort_display=None, return_all=False, hint_size=0, lex=None, rec=False, totals=False, copy=False, cutoffs=None, verbose=False, **kwargs)

Bases: riptable.rt_groupbyops.GroupByOps

Parameters:

dataset (Dataset) – The dataset object
keys (list. List of column names to groupby) –
filter (None. Boolean mask array applied as filter before grouping) –
return_all (bool. Default to False. When set to True will return all) – the dataset columns for every operation.
hint_size (int. Hint size for the hash (optional)) –
sort_display (bool. Default to True. Indicates) –
lex (bool) – Defaults to False. When True uses a lexsort to find the groups (otherwise uses a hash).
totals (bool) –

property gb_keychain

property gbkeys: dictionary of numpy arrays binned from

property ifirstkey

property ilastkey

property isortrows: sorted index or None

property transform

The property transform sets a flag so that the next reduce function called after transform, will repopulate the original array with the reduced value.

Example

>>> ds.groupby(['side', 'venue']).transform.sum()

DebugMode = False

TestCatGb = True

__getattr__(name)

__getattr__ is hit when ‘.’ is used to trim a single column.

Examples

>>> ds = Dataset({'col_'+str(i): np.random.rand(5) for i in range(5)})
>>> ds.keycol = FA(['a','a','b','c','a'])
>>> ds.gb('keycol').col_4.mean()
*keycol   col_4
-------   -----
a          0.73
b          0.03
c          0.76

__getitem__(fld)

__iter__(): Generates tuples of key, value pairs. Keys are key values for single key, or tuples of key values for multikey. Values are datasets containing all rows from data in group for that key.

__repr__(): Return repr(self).

__str__(): Return str(self).

_build_string()

_calculate_all(funcNum, *args, func_param=0, **kwargs): Generate a GroupByKeys object if necessary and ask for the result of a calculation from the grouping object. Returns: a grouped by dataset with the result from the calculation

_getitem(fld)

Called by __getitem__ and __getattr__. Uses the field to index into the stored dataset. Often used to limit the data the groupby operation is being performed on. Returns a shallow copy of the groupby object.

This routine gets hit during the following common code pattern:

>>> ds = Dataset({'col_'+str(i): np.random.rand(5) for i in range(5)})
>>> ds.keycol = FA(['a','a','b','c','a'])
>>> ds.gb('keycol')[['col_1', 'col_2']].sum()
*keycol   col_1   col_2
-------   -----   -----
a          1.92    0.89
b          0.70    0.46
c          0.07    0.42

>>> ds.gb('keycol').col_4.mean()
*keycol   col_4
-------   -----
a          0.73
b          0.03
c          0.76

_grouping_data_as_dict(ds)

_pop_gb_data(calledfrom, userfunc, *args, **kwargs): GroupBy holds on to its dataset. There may be no additional data provided.

add_totals(gb_ds)

as_categorical(): Returns a categorical using the same binning information as the GroupBy object (no addtl. hash required). New categorical will not share a grouping object with this groupby object, but will share a reference to the iKey. Categorical operation results will be sorted or unsorted depending on if ‘gb’ or ‘gbu’ called this.

backfill(limit=0, fill_val=None, inplace=False)

Backward fill the values

Parameters:: limit (integer, optional) – limit of how many values to fill

See also

fill_forward, fill_backward, fill_invalid

copy(deep=True): Called from getitem when user follows gb with []

count(**kwargs): Compute count of group

abstract expanding(**kwargs)

fill_backward(limit=0, fill_val=None, inplace=False)

Replace NaN and invalid array values by propagating the next encountered valid group value backward.

Parameters:

limit (int, default 0) – The maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled. If no limit is specified, all consecutive NaN and invalid values are replaced.
fill_val (scalar, default None) – The value to use where there is no valid group value to propagate backward. If fill_val is not specified, NaN and invalid values aren’t replaced where there is no valid group value to propagate backward.
**kwargs – Additional keyword arguments.

Returns:

The returned Dataset contains the input Dataset object’s numerical columns.

Return type:

Dataset

See also

GroupBy.fill_forward: Replace NaN and invalid array values with the last valid group value.
Categorical.fill_backward: Replace NaN and invalid array values with the next valid group value.
riptable.fill_backward: Replace NaN and invalid values with the next valid value.
Dataset.fillna: Replace NaN and invalid values with a specified value or nearby data.
FastArray.fillna: Replace NaN and invalid values with a specified value or nearby data.
FastArray.replacena: Replace NaN and invalid values with a specified value.

Examples

>>> ds = rt.Dataset({'Key_col' : ['A', 'B', 'A', 'B', 'A', 'B'],
...             'Vals' : [rt.nan, rt.nan, 2, 3, 4, 5]})
>>> ds.gb('Key_col').fill_backward()
#   Vals
-   ----
0   2.00
1   3.00
2   2.00
3   3.00
4   4.00
5   5.00

Use a fill_val to replace values where there’s no valid group value to propagate backward:

>>> ds.Vals = rt.FastArray([0, 1, 2, 3, rt.nan, rt.nan])
>>> ds.gb('Key_col').fill_backward(fill_val = 0)
#   Vals
-   ----
0   0.00
1   1.00
2   2.00
3   3.00
4   0.00
5   0.00

Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values in a group:

>>> ds.Vals = rt.FastArray([rt.nan, rt.nan, rt.nan, rt.nan, 4, 5])
>>> ds.gb('Key_col').fill_backward(limit = 1)
#   Vals
-   ----
0    nan
1    nan
2   4.00
3   5.00
4   4.00
5   5.00

fill_forward(limit=0, fill_val=None, inplace=False)

Replace NaN and invalid array values by propagating the last encountered valid group value forward.

Parameters:

limit (int, default 0) – The maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled. If no limit is specified, all consecutive NaN and invalid values are replaced.
fill_val (scalar, default None) – The value to use where there is no valid group value to propagate forward. If fill_val is not specified, NaN and invalid values aren’t replaced where there is no valid group value to propagate forward.
inplace (bool, default False) – If False, return a copy of the array. If True, modify original data. This will modify any other views on this object. This fails if the array is locked.

Returns:

The returned Dataset contains the input Dataset object’s numerical columns.

Return type:

Dataset

See also

GroupBy.fill_backward: Replace NaN and invalid array values with the next valid group value.
Categorical.fill_forward: Replace NaN and invalid array values with the last valid group value.
riptable.fill_forward: Replace NaN and invalid values with the last valid value.
Dataset.fillna: Replace NaN and invalid values with a specified value or nearby data.
FastArray.fillna: Replace NaN and invalid values with a specified value or nearby data.
FastArray.replacena: Replace NaN and invalid values with a specified value.

Examples

>>> ds = rt.Dataset({'Key_col' : ['A', 'B', 'A', 'B', 'A', 'B'],
...                  'Vals' : [0, 1, 2, 3, rt.nan, rt.nan]})
>>> ds.gb('Key_col').fill_forward()
#   Vals
-   ----
0   0.00
1   1.00
2   2.00
3   3.00
4   2.00
5   3.00

Use a fill_val to replace values where there’s no valid group value to propagate forward:

>>> ds.Vals = rt.FastArray([rt.nan, rt.nan, 2, 3, 4, 5])
>>> ds.gb('Key_col').fill_forward(fill_val = 0)
#   Vals
-   ----
0   0.00
1   0.00
2   2.00
3   3.00
4   4.00
5   5.00

Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values in a group:

>>> ds.Vals = rt.FastArray([0, 1, rt.nan, rt.nan, rt.nan, rt.nan])
>>> ds.gb('Key_col').fill_forward(limit = 1)
#   Vals
-   ----
0   0.00
1   1.00
2   0.00
3   1.00
4    nan
5    nan

get_group(category, **kwargs)

The name of the group to get as a Dataset.

Parameters:: category (string or tuple) – A value from the column used to construct the GroupBy, or if multiple columns were used, a tuple of the multiple columns.
Return type:: Dataset

Example

>>> ds.groupby('symbol').get_group('AAPL')

nth(n=1)

Select the nth row from each group.

Parameters:: n (int) – A single nth value for the row

Examples

>>> ds = rt.Dataset({'A': [1, 1, 2, 1, 2],
...                  'B': [np.nan, 2, 3, 4, 5]})
>>> g = ds.groupby('A')
>>> g.nth(0)
*A      B
--   ----
1    nan
2   3.00

[2 rows x 2 columns] total bytes: 32.0 B

>>> g.nth(1)
*A      B
--   ----
1  2.00
2  5.00

[2 rows x 2 columns] total bytes: 32.0 B

>>> g.nth(-1)
*A      B
--   ----
1  4.00
2  5.00

[2 rows x 2 columns] total bytes: 32.0 B

pad(limit=0, fill_val=None, inplace=False)

Forward fill the values

Parameters:: limit (integer, optional) – limit of how many values to fill

See also

fill_forward, fill_backward, fill_invalid

abstract stack(**kwargs)

abstract unstack(**kwargs)

riptable.rt_groupby

Classes

`riptable.rt_groupby`