riptable.rt_grouping
Classes
Every GroupBy and Categorical object holds a grouping in self.grouping; |
Functions
|
The group_row unique keys are used in the grouping_dict returned. |
|
For hstacking Categoricals or fixing indices in a categorical from a stacked .sds load |
|
|
|
For hstacking Categoricals possibly from a stacked .sds load. |
- class riptable.rt_grouping.Grouping(grouping, categories=None, ordered=None, dtype=None, base_index=1, sort_display=False, filter=None, lex=False, rec=False, categorical=False, cutoffs=None, next=False, unicode=False, name=None, hint_size=0, hash_mode=2, _trusted=False, verbose=False)
Every GroupBy and Categorical object holds a grouping in self.grouping; this class informs the groupby algorithms how to group the data.
Stage 1
Initializing from a GroupBy object or unbinned Categorical object:
grouping_dict: dictionary of non-unique key columns (hash will be performed)
iKey: array size is same as multikey, the unique key for which this row in multikey belongs
iFirstKey: array size is same as unique keys, index into the first row for that unique key
iNextKey:array size is same as multikey, index to the next row that hashed to same value
nCountGroup: array size is same as unique keys, for each unique item, how many values
Initializing from a pre-binned Categorical object:
grouping_dict: dictionary of pre-binned columns (no hash performed)
iKey: array size is same as Categorical’s underlying index array - often uses the same array.
unique_count: unique number of items in the categorical.
Stage 2
iGroup: unique keys are grouped together
iFirstGroup: index into first row for the group
nCountGroup: number of items in the group
Performing calculations
(See
Grouping._calculate_all
)- Parameters:
origdict (*) –
funcNum (*) –
func_param (*) –
table) (1. Check the keywords for "invalid" (wether or not an invalid bin will be included in the result) –
it. (2. Check the keywords for a filter and store) –
packing (* _groupbycalculateallpack - for level 2 functions that require) –
pack_by_group(filter=None, mustrepack=False)
If the grouping object has already been packed and no filter is present, return.
If a filter is present, discard any existing iNextKey and combine the filter with the iKey.
Call the groupbypack routine -> sends info to CPP.
iGroup, iFirstGroup, nCountGroup are returned and stored.
pack_by_group. (call) –
pack_by_group(filter=None, mustrepack=False)
If the grouping object has already been packed and no filter is present, return.
If a filter is present, discard any existing iNextKey and combine the filter with the iKey.
Call the groupbypack routine -> sends info to CPP.
iGroup, iFirstGroup, nCountGroup are returned and stored.
calculation. (4. Prepare the origdict for) –
``_get_calculate_dict(origdict –
Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)
The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.
Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.
Return the cleaned up dictionary, and a list of its columns. (npdict, values)
funcNum –
Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)
The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.
Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.
Return the cleaned up dictionary, and a list of its columns. (npdict, values)
func=None –
Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)
The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.
Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.
Return the cleaned up dictionary, and a list of its columns. (npdict, values)
func_param=0)`` –
Check for “col_idx” in keywords (used for agg function mapping certain operations to certain columns)
The grouping object has a _grouping_dict (keys). If these columns are in origdict, they are removed.
Most operations cannot be performed on strings or string-based categoricals. Remove columns of those types.
Return the cleaned up dictionary, and a list of its columns. (npdict, values)
operation. (5. Perform the) –
cumsum (* rc.EmaAll32 - for) –
cumprod –
ema_decay –
etc. –
exists) (* _groupbycalculateall - for basic functions that don't require packing (combine filter if) –
packing –
sorted. (accum_tuple is a series of columns after the operation. The data has not been) –
requested (accum_tuple has an invalid item at [0] for each column. If no invalid was) –
off. (trim it) –
list. (Store the columns in a) –
accum2 (If the function was called from) –
here. (return) –
columns. (7. Make a dataset from the dictionary of calculated) –
_make_accum_dataset –
Make a dictionary from the list of calculated columns. Use the names from npdict (see step 4)
If nothing was calculated for the column, the value will be None. Remove it.
If the column was a categorical, the calculate dict only has its indices. Pull the categories from the original dictionary and build a new categorical (shallow copy)
columns. –
_return_dataset –
If the function is in cumsum, cumprod, ema_decay, etc. no groupby keys will appear (set to None)
If the function is count, it will have a single column (Count) - build a dataset from this.
Initialize an empty diciontary (newdict).
Iterate over the column names in the original dictionary and copy them to the newdict. accumdict only contains columns that were operated on. If the return_all flag was set to True, these columns still need to be included.
If the function is in cumsum, cumprod, ema_decay, etc. no sort will be applied, no labels (gbkeys) will be tagged
Otherwise, apply a sort (default for GroupBy) to each column with isortrows (from the GroupByKeys object). Tag all label columns in final dataset.
dataset (8. Return the) –
- property _anydict
Either the _grouping_dict or _grouping_unique_dict. Only be used for names, array datatypes. Will check for and return _grouping_dict first.
- property base_index: int
The starting index from which keys (valid groups) are numbered. Always equal to 0 or 1.
- property catinstance
Integer array for constructing Categorical or Categorical-like array.
- Returns:
instance_array – If base index is 1, returns the ikey. If base index is 0, stores and returns ikey - 1. If in enum mode, returns integers from _grouping_dict.
- Return type:
FastArray
- property gbkeys: Mapping[str, numpy.ndarray]
- property ifirstgroup
Returns a sister array used with
ncountgroup
andigroup
.- Returns:
ifirstgroup
- Return type:
np.ndarray of int
See also
- property ifirstkey
returns the row locations of the first member of the group
- Returns:
ifirstkey
- Return type:
np.ndarray of int
- property igroup
returns a fancy index that when applied will make all the groups contiguous (packed together)
- Returns:
igroup
- Return type:
np.ndarray of int
See also
- property igroupreverse
Returns the fancy index to reverse the shuffle from
igroup
.- Returns:
igroupreverse
- Return type:
np.ndarray of int
See also
- property ikey
Returns a 1-based integer array with the bin number for each row.
Bin 0 is reserved for filtered out rows. This property will return +1 for base-0 grouping.
- Returns:
ikey
- Return type:
np.ndarray of int
- property ilastkey
returns the row locations of the last member of the group
- Returns:
ilastkey
- Return type:
np.ndarray of int
- property inextkey
returns the row locations of the next member of the group (or invalid int).
- Returns:
inextkey
- Return type:
np.ndarray of int
- property iprevkey
returns the row locations of the previous member of the group (or invalid int)
- Returns:
iprevkey
- Return type:
np.ndarray of int
- property isdirty: bool
bool, default False If True, it’s possible that not all of the values in between 0 and the unique count appear in the iKey. Number of unique occurring values may be different than number of possible unique values. e.g. after slicing a Categorical.
- Type:
isdirty
- property ismultikey: bool
True if unique dict holds multiple arrays. False if unique dict holds single array or in enum mode.
- property isortrows
- property issinglekey: bool
True if unique dict holds single array. False if unique dict hodls multiple arrays or in enum mode.
- property ncountgroup
returns a sister array used with
ifirstgroup
andigroup
.- Returns:
ncountgroup
- Return type:
np.ndarray of int
See also
- property ncountkey
- returns: ncountkey – An array with the number of unique counts per key
Does include the zero bin
- Return type:
np.ndarray of int
- property packed: bool
The grouping operation has performed an operation that requires packing e.g.
median()
Ifpacked
,iGroup
,iFirstGroup
, andnCountGroup
have been generated.
- property uniquedict: Mapping[str, numpy.ndarray]
Dictionary of key names -> array(s) of unique categories.
GroupBy
will pull values from non-unique dictionary usingiFirstKey
.Categorical
already holds a unique dictionary. Enums will pull withiFirstKey
, and return unique strings after translating integer codes.- Returns:
Dictionary of key names -> array(s) of unique categories.
- Return type:
Notes
No sort is applied here.
- property uniquelist
See Grouping.uniquedict Sets FastArray names as key names.
- DebugMode = False
- GroupingInit
- REGISTERED_REVERSE_TABLES = []
- __getitem__(fld)
Perform an indexing / slice operation on iKey, _catinstance, _grouping_dict if they have been set.
- Parameters:
fld (integer (single item) raise error) – slice integer array (fancy index) boolean array (true/false mask) string, list of strings: raise error
- Returns:
newgroup – A copy of the grouping object with a reindexed iKey. The dirty flag in the result will be set to True. A single scalar value (for enum/singlekey grouping) A tuple of scalar values (for multikey grouping)
- Return type:
Grouping
or scalar or tuple
- __repr__()
Return repr(self).
- _build_unique_dict(grouping)
Pull values from the non-unique grouping dict using the
iFirstKey
index. If enumstring is True, translate enum codes to their strings.
- _calculate_all(origdict, funcNum, func_param=0, keychain=None, user_args=(), tups=0, accum2=False, return_all=False, **kwargs)
All groupby calculations from GroupBy, Categorical, Accum2, and some groupbyops will be enter through this method.
- Parameters:
orgidict –
funcNum –
func_param (int) – parameters from GroupByOps (often simple scalars)
keychain – option groupby keys to apply to the final dataset at end
user_args – A tuple of None or more arguments to pass to user_function.
user_args
only exists for apply* related function callstups (int, 0) – Defaults to 0. 1 if user functions had tuples () indicating to pass in all arrays.
tups
is only > 0 for apply* related function calls where the first parameter was (arr1, ..)accum2 (bool) –
return_all (bool) –
showfilter (bool, optional) – If set will calculate contents in the 0 bin.
See also
- _empty_allowed(funcNum)
Operations like cumcount do not need an origdict to calculate. Calculations are made only on binned columns. Might be more later, so keep here.
- _finalize_dataset(accumdict, keychain, gbkeys, transform=False, showfilter=False, addkeys=False, **kwargs)
possibly transform? TODO: move to here possibly reattach keys possibly sort
- _from_categories(grouping, categories, arr_len, base_index, filter, dtype, ordered, _trusted)
Initialize a Grouping object from pre-defined uniques.
- Parameters:
grouping (dict of single array) – Pre-defined iKey or non-unique values.
categories (dict of arrays) – Pre-defined dictionary of unique categories or enum mapping (not implemented)
arr_len (int) – Length of arrays in
categories
dict.filter (boolean array) – Pre-filter the same length as the non-unique values.
_trusted (bool) – If True, data will not be validated with min / max check.
- Returns:
ikey (ndarray of ints) – Base 0 or Base 1 ikey
ordered_flag (bool) – Flag indicating whether the categories were/are ordered. This is the
ordered
flag just being passed through.
- _get_calculate_dict(origdict, funcNum, func=None, return_all=False, computable=True, func_param=0, **kwargs)
Builds a dictionary to perform the groupby calculation on.
If string/string-like columns cannot be computed, they will not be included. If specific columns have been specified (in col_idx, see GroupBy.agg), only they will be included.
- Returns:
npdict (dict) – Final dictionary for calculation.
values (list) – List of columns in
npdict
. (NOTE: this is repetitive as npdict has these values also.)
- static _hstack(glist, _trusted=False, base_index=1, ordered=False, destroy=False)
‘hstack’ operation for Grouping instances.
- Parameters:
_trusted (bool) – Indicates whether we need to validate the data in the supplied Grouping instances for consistency / correctness before using it. In certain cases, the caller knows the data is safe to use directly (e.g. because they’ve just created it), so the validation can be skipped.
base_index (int) – The base index to use for the resulting Categorical.
ordered (bool) – Indicates whether the resulting Categorical will be an ‘ordered’ Categorical (sometimes called an ‘Ordinal’).
destroy (bool) – This parameter is unused.
- Return type:
- _make_accum_dataset(origdict, npdict, accum, funcNum, return_all=False, keychain=None, **kwargs)
Returns a Dataset
- _make_enumikey(list_values, filter=None)
internal routine to lazy generate ikey for enum if a filter is passed on init, have to generate upfront
will generate ikey, ifirstkey, unique_count also
- _make_isortrows(gbkeys)
Sort a single or multikey dictionary of unique values. Return the sorted index.
- _return_dataset(origdict, accumdict, func_num, return_all=False, col_idx=None, keychain=None, **kwargs)
- _set_anydict(d)
Replace the dict returned by _anydict Will check for and set _grouping_dict first.
- _set_newinstance(newinstance)
- apply(origdict, userfunc, *args, tups=0, filter=None, label_keys=None, return_all=False, **kwargs)
Grouping apply (for Categorical, groupby, accum2) Apply function userfunc group-wise and combine the results together. The userfunc will be called back per group. The order of the groups is either:
Order of first apperance (when coming from a hash)
Lexigraphical order (when lex=True or a Categorical with ordered=True)
If a group from a categorical has no rows (an empty group), then a dataset with one row of invalids (as a place holder) will be used and the userfunc will be called.
The function passed to apply must take a Dataset as its first argument and return one of the following:
a Dataset (with one or more rows returned)
a dictionary of name:array pairs
a single array
The set of returned columns must be consistent for each input (group) dataset.
apply
will then take care of combining the results back together into a Dataset with the groupby key(s) in the initial column(s).apply
is therefore a highly flexible grouping method.While
apply
is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods. riptable offers a wide range of methods that will be much faster than usingapply
for their specific purposes, so try to use them before reaching forapply
.- Parameters:
userfunc (callable) – A callable that takes a Dataset as its first argument, and returns a Dataset, dict, or single array. In addition the callable may take positional and keyword arguments.
args (tuple and dict) – Optional positional and keyword arguments to pass to
userfunc
kwargs (tuple and dict) – Optional positional and keyword arguments to pass to
userfunc
possible) (Returns (2) –
------- –
dataset) (Dataset that is grouped by (reduced from original) –
by) (Dataset of original length (not grouped) –
Examples
>>> ds = rt.Dataset({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]}) >>> g = rt.GroupBy(ds, 'A')
From
ds
above we can see thatg
has two groups,a
andb
. Callingapply
in various ways, we can get different grouping results.Example 1: below the function passed to apply takes a Dataset as its argument and returns a Dataset or dictionary with one row for each row in each group.
apply
combines the result for each group together into a new Dataset:>>> g.apply(lambda x: x.sum()) *A B C -- - -- a 3 10 b 3 5
>>> g.apply(lambda x: {'B':x.B.sum()}) *A B -- - a 3 b 3
Example 2: The function passed to
apply
takes a Dataset as its argument and returns a Dataset with one row per group.apply
combines the result for each group together into a new Dataset:>>> g.apply(lambda x: x.max() - x.min()) *A B C -- - - a 1 2 b 0 0
Example 3: The function passed to
apply
takes a Dataset as its argument and returns a Dataset with one row and one column per group (i.e., a scalar).apply
combines the result for each group together into a Dataset:>>> g.apply(lambda x: rt.Dataset({'val': [x.C.max() - x.B.min()]})) *A val -- --- a 5 b 2
Example 4: The function returns a Dataset with more than one row.
>>> g.apply(lambda x: x.cumsum()) *A B C -- - -- a 1 4 a 3 10 b 3 5
Example 5: A non-lambda, user-supplied function which creates a new column in the existing Dataset.
>>> def userfunc(x): x.Sub = x.C - x.B return x >>> g.apply(userfunc) *A B C Sub -- - - --- a 1 4 3 a 2 6 4 b 3 5 2
- apply_helper(isreduce, origdict, userfunc, *args, tups=0, filter=None, showfilter=False, label_keys=None, func_param=None, dtype=None, badrows=None, badcols=None, computable=True, **kwargs)
Grouping apply_reduce/apply_nonreduce (for Categorical, groupby, accum2)
- For every column of data to be computed:
- The userfunc will be called back per group as a single array. The order of the groups is either:
Order of first apperance (when coming from a hash)
Lexigraphical order (when lex=True or a Categorical with ordered=True)
A reduce function must take an array as its first argument and return back a single scalar value. A non-reduce function must take an array as its first argument and return back another array. The first argument to apply MUST be the callable user function.
The second argument to apply contains one or more arrays to operate on.
If passed as a list, the userfunc is called for each array in the list
If passed as a tuple, the userfunc is called once with all the arrays as parameters
- Parameters:
isreduce (bool) – Must be set. True for reduce, False for non-reduce.
origdict (dict of name:arrays) – The column names and arrays to apply the functio on.
userfunc (callable) – A callable that takes one or more arrays as its first argument, and returns an array or scalar. If
isreduce
is True,userfunc
is a reduction and should return a scalar; whenisreduce
is False,userfunc
is a nonreduce/scan/prefix-sum and should return an array. In addition the callable may take positional arguments and keyword arguments.*args – Any additional user arguments to pass to
userfunc
.tups (0) – Set to 1 if
userfunc
wants multiple arrays passed fixed up by iGroup. Defaults to False. Set to 2 for passing in constantsshowfilter (bool) – Set to True to calculate filter. Defaults to False.
filter (ndarray of bools) – optional boolean filter to apply
label_keys (rt.GroupByKeys, the labels on the left) –
func_param (tuple, optional) – Caller may pass
func_param=(arg1, arg2)
to pass arguments touserfunc
.dtype (str or np.dtype, or dict of np.dtypes, optional) – Explicitly specify the dtype for the output array. Defaults to None, which means the function chooses a compatible dtype for the output. If a dict of np.dtypes is passed, multiple output arrays are allocated based on the specified dtypes.
badrows – not used, may be passed from Acccum2
badcols – not used
Notes
All other arguments passed to this function (if any remaining) will be passed through to
userfunc
.See also
GroupByOps.apply_reduce
,GroupByOps.apply_nonreduce
- as_filter(index)
Returns an index filter for a given unique key
- copy(deep=True)
Create a shallow or deep copy of the grouping object.
- copy_from(other=None)
Initializes a new Grouping object if other is None. Otherwise shallow copy all necessary attributes from another grouping object to self.
- Parameters:
other (
Grouping
) –
- count(gbkeys=None, isortrows=None, keychain=None, filter=None, transform=False, **kwargs)
Compute count of each unique key Returns a dataset containing a single column. The Grouping object has the ability to generate this column on its own, and therefore skips straight to _return_dataset versus other groupby calculations (which pass through _calculate_all) first.
- static extract_groups(condition, grouped_data, ncountgroup, ifirstgroup)
Take groups of elements from an array, where the groups are selected by a boolean mask.
This function provides boolean-indexing over groups of data – so a boolean mask can be used to select _groups_ of data, rather than just individual elements, and the grouped elements will be copied to the output.
- Parameters:
condition (np.ndarray of bool) – An array whose nonzero or True entries indicate the groups in
ncountgroup
whose elements will be extracted fromgrouped_data
.grouped_data (np.ndarray) –
ncountgroup (np.ndarray of int) –
ifirstgroup (np.ndarray of int) –
- Return type:
np.ndarray
- Raises:
ValueError – When
condition
is not a boolean/logical array. Whencondition
andncountgroup
have different shapes. Whenncountgroup
andifirstgroup
have different shapes.
See also
Examples
Select data from an array, where the elements belong to even-numbered groups within the Grouping object.
>>> key_data = rt.FA([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6]) >>> data = rt.arange(len(key_data)) >>> g = rt.Grouping(key_data) >>> group_mask = rt.arange(len(g.ncountgroup)) % 2 == 0 >>> Grouping.extract_groups(group_mask, data, g.ncountgroup, g.ifirstgroup) FastArray([1, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 20])
- get_name()
List of grouping or grouping unique dict keys.
- isin(values)
Used to match values
- Return type:
numpy array of bools where the values are found
See also
rt.Grouping.isin
- ismember(values, reverse=False)
Used to match against the unique categories NOTE: This does not match against the entire array, just the uniques
- Parameters:
reverse (bool, defaults to False.) – Set to True to reverse the
ismember(A, B)
toismember(B,A)
.- Returns:
member_mask (np.ndarray of bool) – boolean array of matches to unique categories
member_indices (np.ndarray of int) – fancy index array of location in unique categories
Examples
>>> a = rt.Cat(['b','c','d']).tile(5) >>> b = rt.Cat(['a','b','d','e','f']).tile(5) >>> tf1 = rt.ismember(a, b)[0] >>> tf2 = b.grouping.ismember(a.categories())[1][b-1] != -128 >>> np.all(tf1 == tf2) True
>>> a = rt.Cat(['BABL','COKE','DELT']).tile(50000) >>> b = rt.Cat(['AAPL','BABL','DELT','ECHO','FB']).tile(33333333) >>> %time tf1 = rt.ismember(a,b)[0] 197 ms >>> %time tf3 = rt.ismember(a.category_array, b.category_array)[1][a-1] != -128 1 ms >>> np.all(tf1 == tf3) True
See also
rt.Grouping.isin
,rt.ismember
- classmethod newclassfrominstance(instance, origin)
- newgroupfrominstance(newinstance)
calculate_all may change the instance
- Parameters:
newinstance (integer based array (codes or bins)) –
- Return type:
a new grouping object
- onedict(unicode=False, invalid=True, sep='_')
Concatenates multikey groupings with underscore to make a single key. Adds ‘Inv’ to first element if kwarg Invalid=True.
- Parameters:
unicode (boolean, default False) – whether to create a string or unicode based array
invalid (boolean, default True) – whether or not to add ‘Inv’ as the first unique
- Returns:
a string of the new key name
a new single array of the uniques concatenated
- pack_by_group(filter=None, mustrepack=False)
Used to prepare data for custom functions
Prepares 3 arrays:
iGroup: array size is same as multikey, unique keys are grouped together
iFirstGroup: array size is number of unique keys for that group, indexes into isort
nCountGroup: array size is number of unique keys for the group
the user should use… igroup, ifirstgroup, ncountgroup
If a filter is passed, it is remembered
- classmethod possibly_recast(arr, unique_count, dtype=None)
unique_count
is checked and compared against preferred (minimal) dtype size is calculated.If a dtype has been provided, it will be used (only if it is large enough to fit the maximum value for the calculated dtype).
- Parameters:
- Returns:
new_arr – A recasted array with a smaller dtype, requested dtype, or possibly the same array as
arr
if no changes were needed.- Return type:
ndarray of ints
- classmethod register_functions(functable)
- regroup(filter=None, ikey=None)
Regenerate the groupings iKey, possibly with a filter and/or eliminating unique values.
- Parameters:
- Returns:
New Grouping object created by regenerating the
ikey
,ifirstkey
, andunique_count
using data from this instance.- Return type:
- set_dirty()
If the shared information (like a Categorical’s instance array) has been changed outside of the grouping object, the changing routine can call this on the grouping object.
- set_name(name)
If the grouping dict contains a single item, rename it.
This will make categorical results consistent with groupby results if they’ve been constructed before being added to a dataset. Ensures that label names are consistent with categorical names.
- Parameters:
name (str) – The new name to use for the single column in the internal grouping dictionary.
Examples
Single key Categorical added to a Dataset, grouping picks up name:
>>> c = rt.Categorical(['a','a','b','c','a']) >>> print(c.get_name()) None
>>> ds = rt.Dataset({'catcol':c}) >>> ds.catcol.sum(rt.arange(5)) *catcol col_0 ------- ----- a 5 b 2 c 3
Multikey Categorical, no names:
>>> c = rt.Categorical([rt.FA(['a','a','b','c','a']), rt.FA([1,1,2,3,1])]) >>> print(c.get_name()) None
>>> ds = rt.Dataset({'mkcol': c}) >>> ds.mkcol.sum(rt.arange(5)) *mkcol_0 *mkcol_1 col_0 -------- -------- ----- a 1 5 b 2 2 c 3 3
Multikey Categorical, already has names for its columns (names are preserved):
>>> arr1 = rt.FA(['a','a','b','c','a']) >>> arr1.set_name('mystrings') >>> arr2 = rt.FA([1,1,2,3,1]) >>> arr2.set_name('myints') >>> c = rt.Categorical([arr1, arr2]) >>> ds = rt.Dataset({'mkcol': c}) >>> ds.mkcol.sum(rt.arange(5)) *mystrings *myints col_0 ---------- ------- ----- a 1 5 b 2 2 c 3 3
- shrink(newcats, misc=None, inplace=False, name=None)
- Parameters:
newcats (array_like) – New categories to replace the old - typically a reduced set of strings
misc (scalar, optional) – Value to use as category for items not found in new categories. This will be added to the new categories. If not provided, all items not found will be set to a filtered bin.
inplace (bool, not implemented) – If True, re-index the categorical’s underlying FastArray. Otherwise, return a new categorical with a new index and grouping object.
name –
- Returns:
A new Grouping object based on this instance’s data and the new set of labels provided in
newcats
.- Return type:
- sort(keylist)
- static take_groups(grouped_data, indices, ncountgroup, ifirstgroup)
Take groups of elements from an array.
This function provides fancy-indexing over groups of data – so a fancy index can be used to specify _groups_ of data, rather than just individual elements, and the grouped elements will be copied to the output.
- Parameters:
- Return type:
np.ndarray
- Raises:
ValueError – When
ncountgroup
andifirstgroup
have different shapes.
See also
Examples
Select data from an array, where the elements belong to the 2nd, 4th, and 6th groups within the Grouping object.
>>> key_data = rt.FA([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6]) >>> data = rt.arange(len(key_data)) >>> g = rt.Grouping(key_data) >>> group_indices = rt.FA([2, 4, 6]) >>> Grouping.take_groups(data, group_indices, g.ncountgroup, g.ifirstgroup) FastArray([1, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 20])
- riptable.rt_grouping.combine2groups(group_row, group_col, filter=None, showfilter=False)
The group_row unique keys are used in the grouping_dict returned. The group_cols unique keys are expected to become columns.
- Parameters:
- Returns:
A new Grouping object The new ikey will always the number of
(group_row.unique_count+1)*(group_col.unique_count+1)
. The grouping_dict in the Grouping object will be for the rows only.- Return type:
- riptable.rt_grouping.hstack_groupings(ikey, uniques, i_cutoffs=None, u_cutoffs=None, from_mapping=False, base_index=1, ordered=False, verbose=False)
For hstacking Categoricals or fixing indices in a categorical from a stacked .sds load Supports Categoricals from single array or dictionary mapping
- Parameters:
- Returns:
list or array_like – list of fixed indices, or array of fixed contiguous indices.
list of ndarray – stacked unique values
- riptable.rt_grouping.hstack_test(arr_list)
- riptable.rt_grouping.merge_cats(indices, listcats, idx_cutoffs=None, unique_cutoffs=None, from_mapping=False, stack=True, base_index=1, ordered=False, verbose=False)
For hstacking Categoricals possibly from a stacked .sds load.
Supports Categoricals from single array or dictionary mapping.
- Parameters:
indices (single stacked array or list of indices) – if single array, needs idx_cutoffs for slicing
listcats (list of stacked unique category arrays (needs unique_cutoffs)) – or list of lists of uniques if the uniques in file1 are ‘A,’C’ and the uniques in file2 are ‘B’,’C,’D’ then listcats is [FastArray(‘A’,’C’,’B’,’C’,’D’)]
idx_cutoffs (ndarray of int64, optional) – int64 array of the cutoffs to the
indices
. if the index length is 30 and 20 the idx_cutoffs is [30,50]unique_cutoffs (list of one int64 array of the cutoffs to the listcats) – if the index length is 2 and 3 the idx_cutoffs is [2,5]
from_mapping (bool) –
stack (bool) –
base_index (int) –
ordered (bool) –
verbose (bool) –
- Returns:
Returns two items
- list of fixed indices, or array of fixed contiguous indices.
- stacked unique values
Notes
TODO: Needs to support multikey cats.