riptable.rt_grouping

Classes

Grouping

Every GroupBy and Categorical object holds a grouping in self.grouping;

Functions

combine2groups(group_row, group_col[, filter, showfilter])

The group_row unique keys are used in the grouping_dict returned.

hstack_groupings(ikey, uniques[, i_cutoffs, ...])

For hstacking Categoricals or fixing indices in a categorical from a stacked .sds load

hstack_test(arr_list)

merge_cats(indices, listcats[, idx_cutoffs, ...])

For hstacking Categoricals possibly from a stacked .sds load.

class riptable.rt_grouping.Grouping(grouping, categories=None, ordered=None, dtype=None, base_index=1, sort_display=False, filter=None, lex=False, rec=False, categorical=False, cutoffs=None, next=False, unicode=False, name=None, hint_size=0, hash_mode=2, _trusted=False, verbose=False)

Every GroupBy and Categorical object holds a grouping in self.grouping; this class informs the groupby algorithms how to group the data.

Stage 1

Initializing from a GroupBy object or unbinned Categorical object:

  • grouping_dict: dictionary of non-unique key columns (hash will be performed)

  • iKey: array size is same as multikey, the unique key for which this row in multikey belongs

  • iFirstKey: array size is same as unique keys, index into the first row for that unique key

  • iNextKey:array size is same as multikey, index to the next row that hashed to same value

  • nCountGroup: array size is same as unique keys, for each unique item, how many values

Initializing from a pre-binned Categorical object:

  • grouping_dict: dictionary of pre-binned columns (no hash performed)

  • iKey: array size is same as Categorical’s underlying index array - often uses the same array.

  • unique_count: unique number of items in the categorical.

Stage 2

  • iGroup: unique keys are grouped together

  • iFirstGroup: index into first row for the group

  • nCountGroup: number of items in the group

Performing calculations

(See Grouping._calculate_all)

Parameters:
  • origdict (*) –

  • funcNum (*) –

  • func_param (*) –

  • table) (1. Check the keywords for "invalid" (wether or not an invalid bin will be included in the result) –

  • it. (2. Check the keywords for a filter and store) –

  • packing (* _groupbycalculateallpack - for level 2 functions that require) –

    pack_by_group(filter=None, mustrepack=False)
    1. If the grouping object has already been packed and no filter is present, return.

    2. If a filter is present, discard any existing iNextKey and combine the filter with the iKey.

    3. Call the groupbypack routine -> sends info to CPP.

    4. iGroup, iFirstGroup, nCountGroup are returned and stored.

  • pack_by_group. (call) –

    pack_by_group(filter=None, mustrepack=False)
    1. If the grouping object has already been packed and no filter is present, return.

    2. If a filter is present, discard any existing iNextKey and combine the filter with the iKey.

    3. Call the groupbypack routine -> sends info to CPP.

    4. iGroup, iFirstGroup, nCountGroup are returned and stored.

  • calculation. (4. Prepare the origdict for) –

  • ``_get_calculate_dict(origdict

    1. Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)

    2. The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.

    3. Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.

    4. Return the cleaned up dictionary, and a list of its columns. (npdict, values)

  • funcNum

    1. Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)

    2. The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.

    3. Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.

    4. Return the cleaned up dictionary, and a list of its columns. (npdict, values)

  • func=None

    1. Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)

    2. The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.

    3. Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.

    4. Return the cleaned up dictionary, and a list of its columns. (npdict, values)

  • func_param=0)``

    1. Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)

    2. The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.

    3. Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.

    4. Return the cleaned up dictionary, and a list of its columns. (npdict, values)

  • operation. (5. Perform the) –

  • cumsum (* rc.EmaAll32 - for) –

  • cumprod

  • ema_decay

  • etc.

  • exists) (* _groupbycalculateall - for basic functions that don't require packing (combine filter if) –

  • packing

  • sorted. (accum_tuple is a series of columns after the operation. The data has not been) –

  • requested (accum_tuple has an invalid item at [0] for each column. If no invalid was) –

  • off. (trim it) –

  • list. (Store the columns in a) –

  • accum2 (If the function was called from) –

  • here. (return) –

  • columns. (7. Make a dataset from the dictionary of calculated) –

  • _make_accum_dataset

    1. Make a dictionary from the list of calculated columns. Use the names from npdict (see step 4)

    2. If nothing was calculated for the column, the value will be None. Remove it.

    3. If the column was a categorical, the calculate dict only has its indices. Pull the categories from the original dictionary and build a new categorical (shallow copy)

  • columns.

  • _return_dataset

    1. If the function is in cumsum, cumprod, ema_decay, etc. no groupby keys will appear (set to None)

    2. If the function is count, it will have a single column (Count) - build a dataset from this.

    3. Initialize an empty diciontary (newdict).

    4. Iterate over the column names in the original dictionary and copy them to the newdict. accumdict only contains columns that were operated on. If the return_all flag was set to True, these columns still need to be included.

    5. If the function is in cumsum, cumprod, ema_decay, etc. no sort will be applied, no labels (gbkeys) will be tagged

    6. Otherwise, apply a sort (default for GroupBy) to each column with isortrows (from the GroupByKeys object). Tag all label columns in final dataset.

  • dataset (8. Return the) –

property _anydict

Either the _grouping_dict or _grouping_unique_dict. Only be used for names, array datatypes. Will check for and return _grouping_dict first.

property all_unique: bool

Indicates whether all keys/groups occur exactly once.

property base_index: int

The starting index from which keys (valid groups) are numbered. Always equal to 0 or 1.

property catinstance

Integer array for constructing Categorical or Categorical-like array.

Returns:

instance_array – If base index is 1, returns the ikey. If base index is 0, stores and returns ikey - 1. If in enum mode, returns integers from _grouping_dict.

Return type:

FastArray

property gbkeys: Mapping[str, numpy.ndarray]
property ifirstgroup

Returns a sister array used with ncountgroup and igroup.

Returns:

ifirstgroup

Return type:

np.ndarray of int

property ifirstkey

returns the row locations of the first member of the group

Returns:

ifirstkey

Return type:

np.ndarray of int

property igroup

returns a fancy index that when applied will make all the groups contiguous (packed together)

Returns:

igroup

Return type:

np.ndarray of int

property igroupreverse

Returns the fancy index to reverse the shuffle from igroup.

Returns:

igroupreverse

Return type:

np.ndarray of int

See also

igroup

property ikey

Returns a 1-based integer array with the bin number for each row.

Bin 0 is reserved for filtered out rows. This property will return +1 for base-0 grouping.

Returns:

ikey

Return type:

np.ndarray of int

property ilastkey

returns the row locations of the last member of the group

Returns:

ilastkey

Return type:

np.ndarray of int

property inextkey

returns the row locations of the next member of the group (or invalid int).

Returns:

inextkey

Return type:

np.ndarray of int

property iprevkey

returns the row locations of the previous member of the group (or invalid int)

Returns:

iprevkey

Return type:

np.ndarray of int

property iscategorical: bool

True if only uniques are being held - no reference to original data.

property isdirty: bool

bool, default False If True, it’s possible that not all of the values in between 0 and the unique count appear in the iKey. Number of unique occurring values may be different than number of possible unique values. e.g. after slicing a Categorical.

Type:

isdirty

property isdisplaysorted: bool
property isenum: bool
property ismultikey: bool

True if unique dict holds multiple arrays. False if unique dict holds single array or in enum mode.

property isordered: bool
property isortrows
property issinglekey: bool

True if unique dict holds single array. False if unique dict hodls multiple arrays or in enum mode.

property ncountgroup

returns a sister array used with ifirstgroup and igroup.

Returns:

ncountgroup

Return type:

np.ndarray of int

property ncountkey
returns: ncountkey – An array with the number of unique counts per key

Does include the zero bin

Return type:

np.ndarray of int

property packed: bool

The grouping operation has performed an operation that requires packing e.g. median() If packed, iGroup, iFirstGroup, and nCountGroup have been generated.

property unique_count: int

Number of unique groups.

property uniquedict: Mapping[str, numpy.ndarray]

Dictionary of key names -> array(s) of unique categories.

GroupBy will pull values from non-unique dictionary using iFirstKey. Categorical already holds a unique dictionary. Enums will pull with iFirstKey, and return unique strings after translating integer codes.

Returns:

Dictionary of key names -> array(s) of unique categories.

Return type:

dict

Notes

No sort is applied here.

property uniquelist

See Grouping.uniquedict Sets FastArray names as key names.

DebugMode = False
GroupingInit
REGISTERED_REVERSE_TABLES = []
__getitem__(fld)

Perform an indexing / slice operation on iKey, _catinstance, _grouping_dict if they have been set.

Parameters:

fld (integer (single item) raise error) – slice integer array (fancy index) boolean array (true/false mask) string, list of strings: raise error

Returns:

newgroup – A copy of the grouping object with a reindexed iKey. The dirty flag in the result will be set to True. A single scalar value (for enum/singlekey grouping) A tuple of scalar values (for multikey grouping)

Return type:

Grouping or scalar or tuple

__repr__()

Return repr(self).

_build_unique_dict(grouping)

Pull values from the non-unique grouping dict using the iFirstKey index. If enumstring is True, translate enum codes to their strings.

_calculate_all(origdict, funcNum, func_param=0, keychain=None, user_args=(), tups=0, accum2=False, return_all=False, **kwargs)

All groupby calculations from GroupBy, Categorical, Accum2, and some groupbyops will be enter through this method.

Parameters:
  • orgidict

  • funcNum

  • func_param (int) – parameters from GroupByOps (often simple scalars)

  • keychain – option groupby keys to apply to the final dataset at end

  • user_args – A tuple of None or more arguments to pass to user_function. user_args only exists for apply* related function calls

  • tups (int, 0) – Defaults to 0. 1 if user functions had tuples () indicating to pass in all arrays. tups is only > 0 for apply* related function calls where the first parameter was (arr1, ..)

  • accum2 (bool) –

  • return_all (bool) –

  • showfilter (bool, optional) – If set will calculate contents in the 0 bin.

See also

Grouping

_empty_allowed(funcNum)

Operations like cumcount do not need an origdict to calculate. Calculations are made only on binned columns. Might be more later, so keep here.

_finalize_dataset(accumdict, keychain, gbkeys, transform=False, showfilter=False, addkeys=False, **kwargs)

possibly transform? TODO: move to here possibly reattach keys possibly sort

Parameters:
  • accumdict (dict or Dataset) –

  • keychain

  • None (gbkeys may be passed as) –

_from_categories(grouping, categories, arr_len, base_index, filter, dtype, ordered, _trusted)

Initialize a Grouping object from pre-defined uniques.

Parameters:
  • grouping (dict of single array) – Pre-defined iKey or non-unique values.

  • categories (dict of arrays) – Pre-defined dictionary of unique categories or enum mapping (not implemented)

  • arr_len (int) – Length of arrays in categories dict.

  • filter (boolean array) – Pre-filter the same length as the non-unique values.

  • _trusted (bool) – If True, data will not be validated with min / max check.

Returns:

  • ikey (ndarray of ints) – Base 0 or Base 1 ikey

  • ordered_flag (bool) – Flag indicating whether the categories were/are ordered. This is the ordered flag just being passed through.

_get_calculate_dict(origdict, funcNum, func=None, return_all=False, computable=True, func_param=0, **kwargs)

Builds a dictionary to perform the groupby calculation on.

If string/string-like columns cannot be computed, they will not be included. If specific columns have been specified (in col_idx, see GroupBy.agg), only they will be included.

Returns:

  • npdict (dict) – Final dictionary for calculation.

  • values (list) – List of columns in npdict. (NOTE: this is repetitive as npdict has these values also.)

static _hstack(glist, _trusted=False, base_index=1, ordered=False, destroy=False)

‘hstack’ operation for Grouping instances.

Parameters:
  • glist (list of Grouping) – A list of Grouping objects.

  • _trusted (bool) – Indicates whether we need to validate the data in the supplied Grouping instances for consistency / correctness before using it. In certain cases, the caller knows the data is safe to use directly (e.g. because they’ve just created it), so the validation can be skipped.

  • base_index (int) – The base index to use for the resulting Categorical.

  • ordered (bool) – Indicates whether the resulting Categorical will be an ‘ordered’ Categorical (sometimes called an ‘Ordinal’).

  • destroy (bool) – This parameter is unused.

Return type:

Grouping

_make_accum_dataset(origdict, npdict, accum, funcNum, return_all=False, keychain=None, **kwargs)

Returns a Dataset

_make_enumikey(list_values, filter=None)

internal routine to lazy generate ikey for enum if a filter is passed on init, have to generate upfront

will generate ikey, ifirstkey, unique_count also

_make_isortrows(gbkeys)

Sort a single or multikey dictionary of unique values. Return the sorted index.

_return_dataset(origdict, accumdict, func_num, return_all=False, col_idx=None, keychain=None, **kwargs)
_set_anydict(d)

Replace the dict returned by _anydict Will check for and set _grouping_dict first.

_set_newinstance(newinstance)
apply(origdict, userfunc, *args, tups=0, filter=None, label_keys=None, return_all=False, **kwargs)

Grouping apply (for Categorical, groupby, accum2) Apply function userfunc group-wise and combine the results together. The userfunc will be called back per group. The order of the groups is either:

  • Order of first apperance (when coming from a hash)

  • Lexigraphical order (when lex=True or a Categorical with ordered=True)

If a group from a categorical has no rows (an empty group), then a dataset with one row of invalids (as a place holder) will be used and the userfunc will be called.

The function passed to apply must take a Dataset as its first argument and return one of the following:

  • a Dataset (with one or more rows returned)

  • a dictionary of name:array pairs

  • a single array

The set of returned columns must be consistent for each input (group) dataset. apply will then take care of combining the results back together into a Dataset with the groupby key(s) in the initial column(s). apply is therefore a highly flexible grouping method.

While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods. riptable offers a wide range of methods that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.

Parameters:
  • userfunc (callable) – A callable that takes a Dataset as its first argument, and returns a Dataset, dict, or single array. In addition the callable may take positional and keyword arguments.

  • args (tuple and dict) – Optional positional and keyword arguments to pass to userfunc

  • kwargs (tuple and dict) – Optional positional and keyword arguments to pass to userfunc

  • possible) (Returns (2) –

  • -------

  • dataset) (Dataset that is grouped by (reduced from original) –

  • by) (Dataset of original length (not grouped) –

Examples

>>> ds = rt.Dataset({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
>>> g = rt.GroupBy(ds, 'A')

From ds above we can see that g has two groups, a and b. Calling apply in various ways, we can get different grouping results.

Example 1: below the function passed to apply takes a Dataset as its argument and returns a Dataset or dictionary with one row for each row in each group. apply combines the result for each group together into a new Dataset:

>>> g.apply(lambda x: x.sum())
*A   B    C
--   -   --
a    3   10
b    3    5
>>> g.apply(lambda x: {'B':x.B.sum()})
*A   B
--   -
a    3
b    3

Example 2: The function passed to apply takes a Dataset as its argument and returns a Dataset with one row per group. apply combines the result for each group together into a new Dataset:

>>> g.apply(lambda x: x.max() - x.min())
*A   B   C
--   -   -
a    1   2
b    0   0

Example 3: The function passed to apply takes a Dataset as its argument and returns a Dataset with one row and one column per group (i.e., a scalar). apply combines the result for each group together into a Dataset:

>>> g.apply(lambda x: rt.Dataset({'val': [x.C.max() - x.B.min()]}))
*A   val
--   ---
a      5
b      2

Example 4: The function returns a Dataset with more than one row.

>>> g.apply(lambda x: x.cumsum())
*A   B    C
--   -   --
a    1    4
a    3   10
b    3    5

Example 5: A non-lambda, user-supplied function which creates a new column in the existing Dataset.

>>> def userfunc(x):
        x.Sub = x.C - x.B
        return x
>>> g.apply(userfunc)
*A   B   C   Sub
--   -   -   ---
a    1   4     3
a    2   6     4
b    3   5     2
apply_helper(isreduce, origdict, userfunc, *args, tups=0, filter=None, showfilter=False, label_keys=None, func_param=None, dtype=None, badrows=None, badcols=None, computable=True, **kwargs)

Grouping apply_reduce/apply_nonreduce (for Categorical, groupby, accum2)

For every column of data to be computed:
The userfunc will be called back per group as a single array. The order of the groups is either:
  1. Order of first apperance (when coming from a hash)

  2. Lexigraphical order (when lex=True or a Categorical with ordered=True)

A reduce function must take an array as its first argument and return back a single scalar value. A non-reduce function must take an array as its first argument and return back another array. The first argument to apply MUST be the callable user function.

The second argument to apply contains one or more arrays to operate on.

  • If passed as a list, the userfunc is called for each array in the list

  • If passed as a tuple, the userfunc is called once with all the arrays as parameters

Parameters:
  • isreduce (bool) – Must be set. True for reduce, False for non-reduce.

  • origdict (dict of name:arrays) – The column names and arrays to apply the functio on.

  • userfunc (callable) – A callable that takes one or more arrays as its first argument, and returns an array or scalar. If isreduce is True, userfunc is a reduction and should return a scalar; when isreduce is False, userfunc is a nonreduce/scan/prefix-sum and should return an array. In addition the callable may take positional arguments and keyword arguments.

  • *args – Any additional user arguments to pass to userfunc.

  • tups (0) – Set to 1 if userfunc wants multiple arrays passed fixed up by iGroup. Defaults to False. Set to 2 for passing in constants

  • showfilter (bool) – Set to True to calculate filter. Defaults to False.

  • filter (ndarray of bools) – optional boolean filter to apply

  • label_keys (rt.GroupByKeys, the labels on the left) –

  • func_param (tuple, optional) – Caller may pass func_param=(arg1, arg2) to pass arguments to userfunc.

  • dtype (str or np.dtype, or dict of np.dtypes, optional) – Explicitly specify the dtype for the output array. Defaults to None, which means the function chooses a compatible dtype for the output. If a dict of np.dtypes is passed, multiple output arrays are allocated based on the specified dtypes.

  • badrows – not used, may be passed from Acccum2

  • badcols – not used

Notes

All other arguments passed to this function (if any remaining) will be passed through to userfunc.

See also

GroupByOps.apply_reduce, GroupByOps.apply_nonreduce

as_filter(index)

Returns an index filter for a given unique key

copy(deep=True)

Create a shallow or deep copy of the grouping object.

Parameters:

deep (bool, default True) – If True, makes a deep copy of all array data.

Returns:

  • newgrouping (Grouping)

  • Note (a shallow copy will always make new dictionaries, but does not copy array data.)

copy_from(other=None)

Initializes a new Grouping object if other is None. Otherwise shallow copy all necessary attributes from another grouping object to self.

Parameters:

other (Grouping) –

count(gbkeys=None, isortrows=None, keychain=None, filter=None, transform=False, **kwargs)

Compute count of each unique key Returns a dataset containing a single column. The Grouping object has the ability to generate this column on its own, and therefore skips straight to _return_dataset versus other groupby calculations (which pass through _calculate_all) first.

static extract_groups(condition, grouped_data, ncountgroup, ifirstgroup)

Take groups of elements from an array, where the groups are selected by a boolean mask.

This function provides boolean-indexing over groups of data – so a boolean mask can be used to select _groups_ of data, rather than just individual elements, and the grouped elements will be copied to the output.

Parameters:
  • condition (np.ndarray of bool) – An array whose nonzero or True entries indicate the groups in ncountgroup whose elements will be extracted from grouped_data.

  • grouped_data (np.ndarray) –

  • ncountgroup (np.ndarray of int) –

  • ifirstgroup (np.ndarray of int) –

Return type:

np.ndarray

Raises:

ValueError – When condition is not a boolean/logical array. When condition and ncountgroup have different shapes. When ncountgroup and ifirstgroup have different shapes.

See also

numpy.extract

Examples

Select data from an array, where the elements belong to even-numbered groups within the Grouping object.

>>> key_data = rt.FA([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6])
>>> data = rt.arange(len(key_data))
>>> g = rt.Grouping(key_data)
>>> group_mask = rt.arange(len(g.ncountgroup)) % 2 == 0
>>> Grouping.extract_groups(group_mask, data, g.ncountgroup, g.ifirstgroup)
FastArray([1, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 20])
get_name()

List of grouping or grouping unique dict keys.

isin(values)

Used to match values

Return type:

numpy array of bools where the values are found

See also

rt.Grouping.isin

ismember(values, reverse=False)

Used to match against the unique categories NOTE: This does not match against the entire array, just the uniques

Parameters:

reverse (bool, defaults to False.) – Set to True to reverse the ismember(A, B) to ismember(B,A).

Returns:

  • member_mask (np.ndarray of bool) – boolean array of matches to unique categories

  • member_indices (np.ndarray of int) – fancy index array of location in unique categories

Examples

>>> a = rt.Cat(['b','c','d']).tile(5)
>>> b = rt.Cat(['a','b','d','e','f']).tile(5)
>>> tf1 = rt.ismember(a, b)[0]
>>> tf2 = b.grouping.ismember(a.categories())[1][b-1] != -128
>>> np.all(tf1 == tf2)
True
>>> a = rt.Cat(['BABL','COKE','DELT']).tile(50000)
>>> b = rt.Cat(['AAPL','BABL','DELT','ECHO','FB']).tile(33333333)
>>> %time tf1 = rt.ismember(a,b)[0]
 197 ms
>>> %time tf3 = rt.ismember(a.category_array, b.category_array)[1][a-1] != -128
 1 ms
>>> np.all(tf1 == tf3)
True

See also

rt.Grouping.isin, rt.ismember

classmethod newclassfrominstance(instance, origin)
newgroupfrominstance(newinstance)

calculate_all may change the instance

Parameters:

newinstance (integer based array (codes or bins)) –

Return type:

a new grouping object

onedict(unicode=False, invalid=True, sep='_')

Concatenates multikey groupings with underscore to make a single key. Adds ‘Inv’ to first element if kwarg Invalid=True.

Parameters:
  • unicode (boolean, default False) – whether to create a string or unicode based array

  • invalid (boolean, default True) – whether or not to add ‘Inv’ as the first unique

Returns:

  • a string of the new key name

  • a new single array of the uniques concatenated

pack_by_group(filter=None, mustrepack=False)

Used to prepare data for custom functions

Prepares 3 arrays:

  • iGroup: array size is same as multikey, unique keys are grouped together

  • iFirstGroup: array size is number of unique keys for that group, indexes into isort

  • nCountGroup: array size is number of unique keys for the group

the user should use… igroup, ifirstgroup, ncountgroup

If a filter is passed, it is remembered

classmethod possibly_recast(arr, unique_count, dtype=None)

unique_count is checked and compared against preferred (minimal) dtype size is calculated.

If a dtype has been provided, it will be used (only if it is large enough to fit the maximum value for the calculated dtype).

Parameters:
  • arr (ndarray of ints) –

  • unique_count (int) – The number of unique bins corresponding to arr.

  • dtype (str or np.dtype, optional) – Optionally force a dtype for the returned integer array (see dtype keyword in the Categorical constructor), defaults to None.

Returns:

new_arr – A recasted array with a smaller dtype, requested dtype, or possibly the same array as arr if no changes were needed.

Return type:

ndarray of ints

classmethod register_functions(functable)
regroup(filter=None, ikey=None)

Regenerate the groupings iKey, possibly with a filter and/or eliminating unique values.

Parameters:
  • filter (np.ndarray of bool, optional) – Filtered bins will be marked as zero in the resulting iKey. If not provided, uniques will be reduced to the ones that occur in the iKey.

  • ikey (np.ndarray of int, optional) – Only used when the grouping is in enum mode.

Returns:

New Grouping object created by regenerating the ikey, ifirstkey, and unique_count using data from this instance.

Return type:

Grouping

set_dirty()

If the shared information (like a Categorical’s instance array) has been changed outside of the grouping object, the changing routine can call this on the grouping object.

set_name(name)

If the grouping dict contains a single item, rename it.

This will make categorical results consistent with groupby results if they’ve been constructed before being added to a dataset. Ensures that label names are consistent with categorical names.

Parameters:

name (str) – The new name to use for the single column in the internal grouping dictionary.

Examples

Single key Categorical added to a Dataset, grouping picks up name:

>>> c = rt.Categorical(['a','a','b','c','a'])
>>> print(c.get_name())
None
>>> ds = rt.Dataset({'catcol':c})
>>> ds.catcol.sum(rt.arange(5))
*catcol   col_0
-------   -----
a             5
b             2
c             3

Multikey Categorical, no names:

>>> c = rt.Categorical([rt.FA(['a','a','b','c','a']), rt.FA([1,1,2,3,1])])
>>> print(c.get_name())
None
>>> ds = rt.Dataset({'mkcol': c})
>>> ds.mkcol.sum(rt.arange(5))
*mkcol_0   *mkcol_1   col_0
--------   --------   -----
a                 1       5
b                 2       2
c                 3       3

Multikey Categorical, already has names for its columns (names are preserved):

>>> arr1 = rt.FA(['a','a','b','c','a'])
>>> arr1.set_name('mystrings')
>>> arr2 = rt.FA([1,1,2,3,1])
>>> arr2.set_name('myints')
>>> c = rt.Categorical([arr1, arr2])
>>> ds = rt.Dataset({'mkcol': c})
>>> ds.mkcol.sum(rt.arange(5))
*mystrings   *myints   col_0
----------   -------   -----
a                  1       5
b                  2       2
c                  3       3
shrink(newcats, misc=None, inplace=False, name=None)
Parameters:
  • newcats (array_like) – New categories to replace the old - typically a reduced set of strings

  • misc (scalar, optional) – Value to use as category for items not found in new categories. This will be added to the new categories. If not provided, all items not found will be set to a filtered bin.

  • inplace (bool, not implemented) – If True, re-index the categorical’s underlying FastArray. Otherwise, return a new categorical with a new index and grouping object.

  • name

Returns:

A new Grouping object based on this instance’s data and the new set of labels provided in newcats.

Return type:

Grouping

sort(keylist)
static take_groups(grouped_data, indices, ncountgroup, ifirstgroup)

Take groups of elements from an array.

This function provides fancy-indexing over groups of data – so a fancy index can be used to specify _groups_ of data, rather than just individual elements, and the grouped elements will be copied to the output.

Parameters:
  • grouped_data (np.ndarray) –

  • indices (np.ndarray of int) –

  • ncountgroup (np.ndarray of int) –

  • ifirstgroup (np.ndarray of int) –

Return type:

np.ndarray

Raises:

ValueError – When ncountgroup and ifirstgroup have different shapes.

See also

numpy.take

Examples

Select data from an array, where the elements belong to the 2nd, 4th, and 6th groups within the Grouping object.

>>> key_data = rt.FA([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6])
>>> data = rt.arange(len(key_data))
>>> g = rt.Grouping(key_data)
>>> group_indices = rt.FA([2, 4, 6])
>>> Grouping.take_groups(data, group_indices, g.ncountgroup, g.ifirstgroup)
FastArray([1, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 20])
riptable.rt_grouping.combine2groups(group_row, group_col, filter=None, showfilter=False)

The group_row unique keys are used in the grouping_dict returned. The group_cols unique keys are expected to become columns.

Parameters:
  • group_row (Grouping) – Grouping object for the rows

  • group_col (Grouping) – Grouping object for the cols

  • filter (np.ndarray of bool, optional) – A boolean filter of values to remove on the rows. Should be same length as group_row.ikey array (can pass in None).

  • showfilter (bool) –

Returns:

A new Grouping object The new ikey will always the number of (group_row.unique_count+1)*(group_col.unique_count+1). The grouping_dict in the Grouping object will be for the rows only.

Return type:

Grouping

riptable.rt_grouping.hstack_groupings(ikey, uniques, i_cutoffs=None, u_cutoffs=None, from_mapping=False, base_index=1, ordered=False, verbose=False)

For hstacking Categoricals or fixing indices in a categorical from a stacked .sds load Supports Categoricals from single array or dictionary mapping

Parameters:
  • indices (single stacked array or list of indices) – if single array, needs idx_cutoffs for slicing

  • uniques (list of stacked unique category arrays (needs unique_cutoffs)) – or list of lists of uniques

  • i_cutoffs

  • u_cutoffs

  • from_mapping (bool) –

  • base_index (int) –

  • ordered (bool) –

  • verbose (bool) –

Returns:

  • list or array_like – list of fixed indices, or array of fixed contiguous indices.

  • list of ndarray – stacked unique values

riptable.rt_grouping.hstack_test(arr_list)
riptable.rt_grouping.merge_cats(indices, listcats, idx_cutoffs=None, unique_cutoffs=None, from_mapping=False, stack=True, base_index=1, ordered=False, verbose=False)

For hstacking Categoricals possibly from a stacked .sds load.

Supports Categoricals from single array or dictionary mapping.

Parameters:
  • indices (single stacked array or list of indices) – if single array, needs idx_cutoffs for slicing

  • listcats (list of stacked unique category arrays (needs unique_cutoffs)) – or list of lists of uniques if the uniques in file1 are ‘A,’C’ and the uniques in file2 are ‘B’,’C,’D’ then listcats is [FastArray(‘A’,’C’,’B’,’C’,’D’)]

  • idx_cutoffs (ndarray of int64, optional) – int64 array of the cutoffs to the indices. if the index length is 30 and 20 the idx_cutoffs is [30,50]

  • unique_cutoffs (list of one int64 array of the cutoffs to the listcats) – if the index length is 2 and 3 the idx_cutoffs is [2,5]

  • from_mapping (bool) –

  • stack (bool) –

  • base_index (int) –

  • ordered (bool) –

  • verbose (bool) –

Returns:

  • Returns two items

  • - list of fixed indices, or array of fixed contiguous indices.

  • - stacked unique values

Notes

TODO: Needs to support multikey cats.