riptable.rt_categorical
Classes
A |
|
Holds categories for each Categorical instance. This adds a layer of abstraction to Categorical. |
Functions
|
Calls Categorical() with base_index keyword set to 0. |
|
|
|
Checks to make sure all unique string values in all dictionaries have the same corresponding integer in every categorical they appear in. |
- class riptable.rt_categorical.Categorical(values, categories=None, ordered=None, sort_gb=None, sort_display=None, lex=None, base_index=None, filter=None, dtype=None, unicode=None, invalid=None, auto_add=False, from_matlab=False, _from_categorical=None)[source]
Bases:
riptable.rt_groupbyops.GroupByOps
,riptable.rt_fastarray.FastArray
A
Categorical
efficiently stores an array of repeated strings and is used for groupby operations.Riptable
Categorical
objects have two related uses:They efficiently store string (or other large dtype) arrays that have repeated values. The repeated values are partitioned into groups (a.k.a. categories), and each group is mapped to an integer. The mapping codes allow the data to be stored and operated on more efficiently.
They’re Riptable’s class for doing groupby operations. A method applied to a
Categorical
is applied to each group separately.
A
Categorical
is typically created from a list of strings:>>> c = rt.Categorical(["b", "a", "b", "a", "c", "c", "b"]) >>> c Categorical([b, a, b, a, c, c, b]) Length: 7 FastArray([2, 1, 2, 1, 3, 3, 2], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
The output shows:
The
Categorical
values. These are grouped into unique categories (here, “a”, “b”, and “c”), which are also stored in theCategorical
(see below).The integer mapping codes (also called bins). Each integer is mapped to a unique category (here, 1 is mapped to “a”, 2 is mapped to “b”, and 3 is mapped to “c”). Because these codes can also be used to index into the
Categorical
, they’re also referred to as indices. By default, the index is 1-based, with 0 reserved for Filtered values.The unique categories. Each category represents a group for groupby operations.
Use
Categorical
objects to perform aggregations over arbitrary arrays of the same dimension as theCategorical
:>>> c = rt.Categorical(["b", "a", "b", "a", "c", "c", "b"]) >>> ints = rt.FA([3, 10, 2, 5, 4, 1, 1]) >>> flts = rt.FA([1.2, 3.4, 5.6, 4.0, 2.1, 0.6, 11.3]) >>> c.sum([ints, flts]) *key_0 col_0 col_1 ------ ----- ----- a 15 7.40 b 6 18.10 c 5 2.70 [3 rows x 3 columns] total bytes: 51.0 B
Multi-Key Categoricals
The
Categorical
above is a single-keyCategorical
– it groups one array of values into keys (the categories) for groupby operations.Multi-key
Categorical
objects let you create and operate on groupings based on multiple associated categories. The associated keys form a group:>>> strs = rt.FastArray(["a", "b", "b", "a", "b", "a"]) >>> ints = rt.FastArray([2, 1, 1, 2, 1, 1]) >>> c = rt.Categorical([strs, ints]) # Create a with a list of arrays. >>> c Categorical([(a, 2), (b, 1), (b, 1), (a, 2), (b, 1), (a, 1)]) Length: 6 FastArray([1, 2, 2, 1, 2, 3], dtype=int8) Base Index: 1 {'key_0': FastArray([b'a', b'b', b'a'], dtype='|S1'), 'key_1': FastArray([2, 1, 1])} Unique count: 3 >>> c.count() *key_0 *key_1 Count ------ ------ ----- a 2 2 b 1 3 a 1 1 [3 rows x 3 columns] total bytes: 27.0 B
Filtered Values and Categories
Filter values and categories to exclude them from operations on the
Categorical
.Categorical
objects can be filtered when they’re created or anytime afterwards. Because filtered items are mapped to 0 in the integer mapping array, filters can be used only in base-1Categorical
objects.Filters can also be applied on a one-off basis at the time of an operation. See the Filtering topic under More About Categoricals for examples.
More About Categorials
For more about using
Categorical
objects, see the Categoricals section of the Intro to Riptable or these more in-depth topics:- Parameters:
values (array of str, int, or float, list of arrays, dict, or Categorical or pandas.Categorical) –
Strings: Unicode strings and byte strings are supported.
Integers without provided categories: The integer mapping codes start at 1.
Integers with provided categories: If you have an array of integers that indexes into an array of provided unique categories, the integers are used for the integer mapping array. Any 0 values are mapped to the Filtered category.
Floats are supported with no user-provided categories. If you have a Matlab Categorical with categories, set
from_matlab
toTrue
.Categorical
objects created from Matlab Categoricals must have a base-1 index; any 0.0 values become Filtered.A list of arrays or a dictionary with multiple key-value pairs creates a multi-key
Categorical
.For a
Categorical
created from aCategorical
, a deep copy of categories is performed.For a
Categorical
created from a Pandas Categorical, a deep copy is performed and indices start at 1 to preserve invalid values.Categorical
objects created from Pandas Catagoricals must have a base-1 index.
categories (array of str, int, or float, dict of {str : int} or {int : str}, or IntEnum, optional) –
The unique categories. Can be:
An array of strings, integers, or floats. Floats can be used only when
values
is numeric. Warning: Non-unique categories may give unexpected results in operations.A dictionary or
IntEnum
that maps integers to strings or strings to integers. Providedvalues
must be integers.
Note:
User-provided categories are always held in the order provided.
Multi-key
Categorical
objects don’t support user-provided categories.
ordered (bool, default None/True) –
Controls whether categories are sorted lexicographically before they are mapped to integers:
If categories are not provided, by default they are sorted. If
ordered=False
, the order is first appearance unlesslex=True
. To sort categories for groupby operations, usesort_gb=True
(see below).If categories are provided, they are always held in the order they’re provided in; they can’t be sorted with
ordered
orlex
.
sort_gb (bool, default None/False) – Controls whether groupby operation results are displayed in sorted order. Note that results may already appear sorted based on
ordered
orlex
settings.sort_display (bool, optional) – See
sort_gb
.lex (bool, default None/False) – Controls whether hashing- or sorting-based logic is used to find unique values in the input array. By default hashing is used. If more than 50% of the values are unique, set
lex=True
for a possibly faster lexicographical sort (not supported if categories are provided).base_index ({None, 0, 1}, default None/1) –
By default, base-1 indexing is used. Base-0 can be used if:
A mapping dictionary isn’t used. A
Categorical
created from a mapping dictionary does not have a base index.A
filter
isn’t used at creation.A Matlab or Pandas Categorical isn’t being converted. These both reserve 0 for invalid values.
If base-0 indexing is used, 0 becomes a valid category.
filter (array of bool, optional) – Must be the same length as
values
. Values that areFalse
become Filtered and mapped to 0 in the integer mapping array, and they are ignored in groupby operations. A filter can’t be used with a base-0Categorical
or one created with a mapping dictionary orIntEnum
.dtype (riptable.dtype, numpy.dtype, or str, optional) – Force the dtype of the underlying integer mapping array. Must be a signed integer dtype. By default, the constructor uses the smallest dtype based on the number of unique categories or the maximum value provided in a mapping.
unicode (bool, default False) – By default, the array of unique categories is stored as byte strings. Set to
True
to store as unicode strings.invalid (str, optional) – Specify a value in
values
to be treated as an invalid category. Note: Invalid categories are not excluded from aggregations; usefilter
instead. Warning: If the invalid category isn’t included incategories
and afilter
is used, the invalid category becomes Filtered.auto_add (bool, default False) – Warning: Until a known issue is fixed, adding categories can have unexpected results. Intended behavior: When set to
True
, categories that do not exist in the unique categories can be added usingcategory_add
.from_matlab (bool, default False) – Set to
True
to convert a Matlab Categorical. The float indices are converted to an integer type. To preserve invalid values, only base-1 indexing is supported.
See also
Accum2
Class for multi-key aggregations with summary data displayed.
Categorical._fa
Return the array of integer category mapping codes that corresponds to the array of
Categorical
values.Categorical.category_array
Return the array of unique categories of a
Categorical
.Categorical.category_dict
Return a dictionary of the unique categories.
Categorical.category_mapping
Return a dictionary of the integer category mapping codes for a
Categorical
created with anIntEnum
or a mapping dictionary.Categorical.base_index
See the base index of a
Categorical
.Categorical.isnan
See which
Categorical
category is invalid.
Examples
A single-key
Categorical
created from a list of strings:>>> c = rt.Categorical(["b", "a", "b", "a", "c", "c", "b"]) Categorical([b, a, b, a, c, c, b]) Length: 7 FastArray([2, 1, 2, 1, 3, 3, 2], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
A
Categorical
created from list of non-unique string values and a list of unique category strings. All values must appear in the provided categories, otherwise an error is raised:>>> rt.Categorical(["b", "a", "b", "c", "a", "c", "c", "c"], categories=["b", "a", "c"]) Categorical([b, a, b, c, a, c, c, c]) Length: 8 FastArray([1, 2, 1, 3, 2, 3, 3, 3], dtype=int8) Base Index: 1 FastArray([b'b', b'a', b'c'], dtype='|S1') Unique count: 3
A
Categorical
created from a list of integers that index into a list of unique strings. The integers are used for the mapping array. Note that 0 becomes Filtered:>>> rt.Categorical([0, 1, 1, 0, 2, 1, 2], categories=["c", "a", "b"]) Categorical([Filtered, c, c, Filtered, a, c, a]) Length: 7 FastArray([0, 1, 1, 0, 2, 1, 2]) Base Index: 1 FastArray([b'c', b'a', b'b'], dtype='|S1') Unique count: 3
If integers are provided with no categories and 0 is included, the integer mapping codes are incremented by 1 so that 0 is not Filtered:
>>> rt.Categorical([0, 1, 1, 0, 2, 1, 2]) Categorical([0, 1, 1, 0, 2, 1, 2]) Length: 7 FastArray([1, 2, 2, 1, 3, 2, 3], dtype=int8) Base Index: 1 FastArray([0, 1, 2]) Unique count: 3
Use
from_matlab=True
to create aCategorical
from Matlab data. The float indices are converted to an integer type. To preserve invalid values, only base-1 indexing is supported:>>> rt.Categorical([0.0, 1.0, 2.0, 3.0, 1.0, 1.0], categories=["b", "c", "a"], from_matlab=True) Categorical([Filtered, b, c, a, b, b]) Length: 6 FastArray([0, 1, 2, 3, 1, 1], dtype=int8) Base Index: 1 FastArray([b'b', b'c', b'a'], dtype='|S1') Unique count: 3
A
Categorical
created from a Pandas Categorical with an invalid value:>>> import pandas as pd >>> pdc = pd.Categorical(["a", "a", "z", "b", "c"], ["c", "b", "a"]) >>> pdc ['a', 'a', NaN, 'b', 'c'] Categories (3, object): ['c', 'b', 'a'] >>> rt.Categorical(pdc) Categorical([a, a, Filtered, b, c]) Length: 5 FastArray([3, 3, 0, 2, 1], dtype=int8) Base Index: 1 FastArray([b'c', b'b', b'a'], dtype='|S1') Unique count: 3
A
Categorical
created from a Python dictionary of strings to integers. The dictionary is provided as thecategories
argument, with a list of the mapping codes provided as the first argument:>>> d = {"StronglyAgree": 44, "Agree": 133, "Disagree": 75, "StronglyDisagree": 1, "NeitherAgreeNorDisagree": 144 } >>> codes = [1, 44, 44, 133, 75] >>> rt.Categorical(codes, categories=d) Categorical([StronglyDisagree, StronglyAgree, StronglyAgree, Agree, Disagree]) Length: 5 FastArray([ 1, 44, 44, 133, 75]) Base Index: None {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 4
A
Categorical
created using the categories of anotherCategorical
:>>> c = rt.Categorical(["a", "a", "b", "a", "c", "c", "b"], categories=["c", "b", "a"]) >>> c.category_array FastArray([b'c', b'b', b'a'], dtype='|S1') >>> c2 = rt.Categorical(["b", "c", "c", "b"], categories=c.category_array) >>> c2 Categorical([b, c, c, b]) Length: 4 FastArray([2, 1, 1, 2], dtype=int8) Base Index: 1 FastArray([b'c', b'b', b'a'], dtype='|S1') Unique count: 3
Multi-key Categoricals let you create and operate on groupings based on multiple associated categories:
>>> strs = rt.FastArray(["a", "b", "b", "a", "b", "a"]) >>> ints = rt.FastArray([2, 1, 1, 2, 1, 3]) >>> c = rt.Categorical([strs, ints]) # Create with a list of arrays. >>> c Categorical([(a, 2), (b, 1), (b, 1), (a, 2), (b, 1), (a, 3)]) Length: 6 FastArray([1, 2, 2, 1, 2, 3], dtype=int8) Base Index: 1 {'key_0': FastArray([b'a', b'b', b'a'], dtype='|S1'), 'key_1': FastArray([2, 1, 3])} Unique count: 3 >>> c.count() *key_0 *key_1 Count ------ ------ ----- a 2 2 b 1 3 a 3 1 [3 rows x 3 columns] total bytes: 27.0 B
- property _categories
- property _fa: riptable.rt_fastarray.FastArray
Return the array of integer category mapping codes that corresponds to the array of
Categorical
values.- Returns:
A
FastArray
of the integer category mapping codes of theCategorical
.- Return type:
See also
Categorical.category_array
Return the array of unique categories of a
Categorical
.Categorical.categories
Return the unique categories of a single-key or multi-key
Categorical
, prepended with the ‘Filtered’ category.Categorical.category_dict
Return a dictionary of the unique categories.
Categorical.category_mapping
Return a dictionary of the integer category mapping codes for a
Categorical
created with anIntEnum
or a mapping dictionary.
Examples
Single-key string
Categorical
:>>> c = rt.Categorical(['a','a','b','c','a']) >>> c Categorical([a, a, b, c, a]) Length: 5 FastArray([1, 1, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c._fa FastArray([1, 1, 2, 3, 1], dtype=int8)
Multi-key
Categorical
:>>> c2 = rt.Categorical([rt.FA([1, 2, 3, 3, 3, 1]), rt.FA(['a','b','c','c','c','a'])]) >>> c2 Categorical([(1, a), (2, b), (3, c), (3, c), (3, c), (1, a)]) Length: 6 FastArray([1, 2, 3, 3, 3, 1], dtype=int8) Base Index: 1 {'key_0': FastArray([1, 2, 3]), 'key_1': FastArray([b'a', b'b', b'c'], dtype='|S1')} Unique count: 3 >>> c2._fa FastArray([1, 2, 3, 3, 3, 1], dtype=int8)
A
Categorical
constructed with anIntEnum
or a mapping dictionary returns the provided integer category mapping codes:>>> log_levels = {10: "DEBUG", 20: "INFO", 30: "WARNING", 40: "ERROR", 50: "CRITICAL"} >>> c3 = rt.Categorical([10, 10, 40, 0, 50, 10, 30], log_levels) >>> c3 Categorical([DEBUG, DEBUG, ERROR, !<0>, CRITICAL, DEBUG, WARNING]) Length: 7 FastArray([10, 10, 40, 0, 50, 10, 30]) Base Index: None {10:'DEBUG', 20:'INFO', 30:'WARNING', 40:'ERROR', 50:'CRITICAL'} Unique count: 5 >>> c3._fa FastArray([10, 10, 40, 0, 50, 10, 30])
A ‘Filtered’ category is mapped to 0 in the integer array:
>>> c4 = rt.Categorical(['b','b','c','d','e','b','c']) >>> c4 Categorical([b, b, c, d, e, b, c]) Length: 7 FastArray([1, 1, 2, 3, 4, 1, 2], dtype=int8) Base Index: 1 FastArray([b'b', b'c', b'd', b'e'], dtype='|S1') Unique count: 4 >>> c4._fa FastArray([1, 1, 2, 3, 4, 1, 2], dtype=int8) >>> c4.category_remove('c') # A removed category becomes 'Filtered'. >>> c4 Categorical([b, b, Filtered, d, e, b, Filtered]) Length: 7 FastArray([1, 1, 0, 2, 3, 1, 0], dtype=int8) Base Index: 1 FastArray([b'b', b'c', b'd', b'e'], dtype='|S1') Unique count: 4 >>> c4._fa FastArray([1, 1, 0, 2, 3, 1, 0], dtype=int8)
- property _total_size: int
Returns total size in bytes of Categorical’s Index FastArray and category array(s).
- property as_string_array: riptable.rt_fastarray.FastArray
Return the full list of values of a
Categorical
as a string array.For multi-key
Categorical
objects, the corresponding keys are concatenated with a “_” separator.Filtered values become the string “Filtered”. Values from invalid categories are treated the same way as values from valid categories.
NOTE: This routine is costly because it re-expands the full list of values as strings.
- Returns:
A
FastArray
of the string values of theCategorical
.- Return type:
rt_fastarray.FastArray
See also
rt_categorical.Categorical.expand_array()
Return the full list of
Categorical
values.
Notes
This method works by applying an index mask to the unique categories.
Examples
Single-key string
Categorical
:>>> c = rt.Categorical(["AAPL", "MSFT", "AAPL", "TSLA", "MSFT", "TSLA", "AAPL"]) >>> c Categorical([AAPL, MSFT, AAPL, TSLA, MSFT, TSLA, AAPL]) Length: 7 FastArray([1, 2, 1, 3, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'AAPL', b'MSFT', b'TSLA'], dtype='|S4') Unique count: 3 >>> c.as_string_array FastArray([b'AAPL', b'MSFT', b'AAPL', b'TSLA', b'MSFT', b'TSLA', b'AAPL'], dtype='|S8')
Single-key integer
Categorical
:>>> c = rt.Categorical([1, 2, 1, 1, 3, 2, 3]) >>> c.as_string_array FastArray([b'1', b'2', b'1', b'1', b'3', b'2', b'3'], dtype='|S21')
Multi-key
Categorical
:>>> key1 = rt.FastArray(["AAPL", "MSFT", "AAPL", "TSLA", "MSFT", "TSLA", "AAPL"]) >>> key2 = rt.FastArray([1, 1, 2, 2, 3, 3, 4]) >>> mk_cat = rt.Categorical([key1, key2]) >>> mk_cat Categorical([(AAPL, 1), (MSFT, 1), (AAPL, 2), (TSLA, 2), (MSFT, 3), (TSLA, 3), (AAPL, 4)]) Length: 7 FastArray([1, 2, 3, 4, 5, 6, 7], dtype=int8) Base Index: 1 {'key_0': FastArray([b'AAPL', b'MSFT', b'AAPL', b'TSLA', b'MSFT', b'TSLA', b'AAPL'], dtype='|S4'), 'key_1': FastArray([1, 1, 2, 2, 3, 3, 4])} Unique count: 7 >>> mk_cat.as_string_array FastArray([b'AAPL_1', b'MSFT_1', b'AAPL_2', b'TSLA_2', b'MSFT_3', b'TSLA_3', b'AAPL_4'], dtype='|S26')
- property base_index: enum.IntEnum
- property category_array: riptable.rt_fastarray.FastArray
Return the array of unique categories of a
Categorical
.Unlike
Categorical.categories
, this method does not prepend the ‘Filtered’ category to the returned array.Raises an error for multi-key
Categorical
objects. To get the categories of a multi-keyCategorical
, useCategorical.categories
.- Returns:
A
FastArray
of the unique categories of theCategorical
.- Return type:
See also
Categorical._fa
Return the array of integer category mapping codes that corresponds to the array of
Categorical
values.Categorical.categories
Return the unique categories of a single-key or multi-key
Categorical
, prepended with the ‘Filtered’ category.Categorical.category_dict
Return a dictionary of the unique categories.
Categorical.category_mapping
Return a dictionary of the integer category mapping codes for a
Categorical
created with anIntEnum
or a mapping dictionary.
Examples
Single-key string
Categorical
:>>> c = rt.Categorical(['a','a','b','c','a']) >>> c Categorical([a, a, b, c, a]) Length: 5 FastArray([1, 1, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c.category_array FastArray([b'a', b'b', b'c'], dtype='|S1')
Single-key integer
Categorical
:>>> c2 = rt.Categorical([4, 5, 4, 4, 6, 5, 6]) >>> c2 Categorical([4, 5, 4, 4, 6, 5, 6]) Length: 7 FastArray([1, 2, 1, 1, 3, 2, 3], dtype=int8) Base Index: 1 FastArray([4, 5, 6]) Unique count: 3 >>> c2.category_array FastArray([4, 5, 6])
Single-key integer
Categorical
with categories provided:>>> c3 = rt.Categorical([2, 3, 4, 2, 3, 4], categories=['a', 'b', 'c', 'd', 'e']) >>> c3 Categorical([b, c, d, b, c, d]) Length: 6 FastArray([2, 3, 4, 2, 3, 4]) Base Index: 1 FastArray([b'a', b'b', b'c', b'd', b'e'], dtype='|S1') Unique count: 5 >>> c3.category_array FastArray([b'a', b'b', b'c', b'd', b'e'], dtype='|S1')
The ‘Filtered’ category isn’t included:
>>> c4 = rt.Categorical([0, 1, 1, 0, 2, 1, 1, 1, 2, 0], categories=['a', 'b', 'c']) >>> c4 Categorical([Filtered, a, a, Filtered, b, a, a, a, b, Filtered]) Length: 10 FastArray([0, 1, 1, 0, 2, 1, 1, 1, 2, 0]) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c4.category_array FastArray([b'a', b'b', b'c'], dtype='|S1')
A
Categorical
constructed with anIntEnum
or a mapping dictionary returns the provided string categories:>>> log_levels = {10: "DEBUG", 20: "INFO", 30: "WARNING", 40: "ERROR", 50: "CRITICAL"} >>> c5 = rt.Categorical([10, 10, 40, 0, 50, 10, 30], log_levels) >>> c5 Categorical([DEBUG, DEBUG, ERROR, !<0>, CRITICAL, DEBUG, WARNING]) Length: 7 FastArray([10, 10, 40, 0, 50, 10, 30]) Base Index: None {10:'DEBUG', 20:'INFO', 30:'WARNING', 40:'ERROR', 50:'CRITICAL'} Unique count: 5 >>> c5.category_array FastArray([b'DEBUG', b'INFO', b'WARNING', b'ERROR', b'CRITICAL'], dtype='|S8')
- property category_codes: riptable.rt_fastarray.FastArray
- property category_dict: Mapping[str, riptable.rt_fastarray.FastArray]
When possible, returns the dictionary of stored unique categories, otherwise raises an error.
Unlike the default for categories(), this will not prepend the invalid category to each array.
- property category_mode: riptable.rt_enum.CategoryMode
Returns the category mode of the Categorical’s Categories object. List modes are when the categorical has gone through the unique/mbget process of binning. Dict modes are when the categorical was constructed with a dictionary mapping or IntEnum. Grouping mode is when the categorical was binned with the groupby hash (numeric list, multikey, etc.)
- Returns:
see CategoryMode in rt_enum.py
- Return type:
IntEnum
- property expand_array: numpy.ndarray | Tuple[numpy.ndarray, Ellipsis]
Return the full list of values of a
Categorical
.If the
Categorical
is constructed with anIntEnum
or a mapping dictionary, the integer mapping codes are returned.Filtered
Categorical
values are returned as “Filtered” for string arrays or numeric sentinel values for numeric arrays.Note that because the expansion constructs the complete list of values from the list of unique categories, it is an expensive operation.
- Returns:
For single-key
Categorical
objects, aFastArray
is returned. For multi-keyCategorical
objects, a tuple ofFastArray
objects is returned.- Return type:
- Warns:
Performance warning – Will warn the user if a large
Categorical
(more than 100,000 items) is being re-expanded.
See also
Categorical.as_string_array
Return the full list of values of a
Categorical
as a string array.
Examples
Single-key
Categorical
:>>> c = rt.Categorical(["a", "a", "b", "c", "a"]) >>> c.expand_array FastArray([b'a', b'a', b'b', b'c', b'a'], dtype='|S3')
Multi-key
Categorical
:>>> c = rt.Categorical([rt.FastArray(["a", "b", "c", "a"]), rt.FastArray([1, 2, 3, 1])]) >>> c.expand_array (FastArray([b'a', b'b', b'c', b'a'], dtype='|S8'), FastArray([1, 2, 3, 1]))
For a
Categorical
constructed with anIntEnum
or a mapping dictionary, the array of integer mapping codes (c._fa
) is returned:>>> c = rt.Categorical([2, 2, 2, 1, 3], {"a": 1, "b": 2, "c": 3}) >>> c Categorical([b, b, b, a, c]) Length: 5 FastArray([2, 2, 2, 1, 3]) Base Index: None {1:'a', 2:'b', 3:'c'} Unique count: 3 >>> c.expand_array FastArray([2, 2, 2, 1, 3]) >>> c._fa FastArray([2, 2, 2, 1, 3])
Filtered string
Categorical
values are returned as the string “Filtered”:>>> a = rt.FastArray(["a", "c", "b", "b", "c", "a"]) >>> f = rt.FastArray([False, False, True, True, True, True]) >>> c = rt.Categorical(a, filter=f) >>> c Categorical([Filtered, Filtered, b, b, c, a]) Length: 6 FastArray([0, 0, 2, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c.expand_array FastArray([b'Filtered', b'Filtered', b'b', b'b', b'c', b'a'], dtype='|S8')
Filtered integer
Categorical
values are returned as the integer sentinel value:>>> a = rt.FastArray([1, 3, 2, 2, 3, 1]) >>> f = rt.FastArray([False, False, True, True, True, True]) >>> c = rt.Categorical(a, filter=f) >>> c Categorical([Filtered, Filtered, 2, 2, 3, 1]) Length: 6 FastArray([0, 0, 2, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([1, 2, 3]) Unique count: 3 >>> c.expand_array FastArray([-2147483648, -2147483648, 2, 2, 3, 1])
- property expand_dict: Dict[str, riptable.rt_fastarray.FastArray]
returns: A dictionary of expanded single or multikey columns. :rtype: dict
Notes
Will warn the user if a large categorical ( > 100,000 items ) is being re-expanded.
Examples
>>> c = rt.Categorical([FA(['a','a','b','c','a']), rt.arange(5)]) >>> c.expand_dict {'key_0': FastArray([b'a', b'a', b'b', b'c', b'a'], dtype='|S3'), 'key_1': FastArray([0, 1, 2, 3, 4])}
- property filtered_name: str
Item displayed when a 0 bin is encountered. Will be omitted from groupby results by default.
- property filtered_string
- property gb_keychain
- property groupby_data
All GroupByOps objects can hold a default dataset to perform operations on. GroupBy always holds a dataset. Categorical and Accum2 do not.
Examples
By default, requires data to be passed:
>>> c = rt.Categorical(['a','b','c']) >>> c.sum() ValueError: Useable data has not been specified in (). Pass in array data to operate on.
After the result of a Dataset.cat() operation, groupby data is set.
>>> ds = rt.Dataset({'groups':np.random.choice(['a','b','c'],10), 'data': rt.arange(10), 'data2': rt.arange(10)}) >>> ds # groups data data2 - ------ ---- ----- 0 a 0 0 1 a 1 1 2 c 2 2 3 c 3 3 4 a 4 4 5 a 5 5 6 c 6 6 7 b 7 7 8 c 8 8 9 a 9 9 >>> c = ds.cat('groups') >>> c.sum() *groups data data2 ------- ---- ----- a 19 19 b 7 7 c 19 19
- property grouping
Grouping object that is called to perform calculations on grouped data. In the constructor, a grouping object provides a categorical with its instance array. The grouping object stores and generates other groupby information, like grouping indices, first occurrence, count, etc. The grouping object should be queried for all grouping-related properties. This is also a property in GroupBy, and is called by many routines in the GroupByOps parent class.
See Also: Grouping
- property grouping_dict
Grouping dict held by Grouping object. May trigger lazy build of Grouping object.
- property ifirstkey
Index of first occurrence of each unique key. May also trigger lazy evaluation of grouping object. If grouping object used the Groupby hash, it will have an iFirstKey array, otherwise returns None.
- property ikey
Returns the grouping object’s iKey. This will always be a 1-base index, and is often the same array as the Categorical. See also: grouping.ikey (may return base 0 index)
- property ilastkey
Index of last occurrence of each unique key. May also trigger lazy evaluation of grouping object. If grouping object used the Groupby hash, it will have an iLastKey array, otherwise returns None.
- property invalid_category
The
Categorical
object’s invalid category.An invalid category is specified when the
Categorical
is created or set afterward usingCategorical.invalid_set
. An invalid category is different from a Filtered category or a NaN value.- Returns:
The invalid category of the
Categorical
. ReturnsNone
if there’s no invalid category.- Return type:
See also
Categorical.filtered_name
Item displayed when a 0 bin is encountered in a
Categorical
.Categorical.isnan
Find the invalid elements of a
Categorical
.Categorical.isnotnan
Find the valid elements of a
Categorical.
Examples
>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b") >>> c Categorical([b, a, c, b, c]) Length: 5 FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c.invalid_category 'b' >>> c.isnan() # Returns True for invalid category. FastArray([ True, False, False, True, False])
Invalid categories are different from Filtered categories:
>>> f = rt.FA([False, True, True, False, True]) >>> c2 = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="a", filter=f) >>> c2 Categorical([Filtered, a, c, Filtered, c]) Length: 5 FastArray([0, 1, 2, 0, 2], dtype=int8) Base Index: 1 FastArray([b'a', b'c'], dtype='|S1') Unique count: 2 >>> c2.invalid_category 'a' >>> c2.isnan() # Show which values are in the invalid category. FastArray([False, True, False, False, False]) >>> c2.isfiltered() # Show which values are Filtered. FastArray([ True, False, False, True, False])
Invalid categories in a
Categorical
are different from regular integer NaN values. An integer NaN is a valid category and isFalse
forCat.isnan()
:>>> a = rt.FA([1, 2, 3, 4]) >>> a[3] = a.inv # Set the last value to an integer NaN. >>> a FastArray([ 1, 2, 3, -2147483648]) >>> c3 = rt.Categorical(values=a, invalid=2) # Make 2 an invalid category. >>> c3 Categorical([1, 2, 3, -2147483648]) Length: 4 FastArray([2, 3, 4, 1], dtype=int8) Base Index: 1 FastArray([-2147483648, 1, 2, 3]) Unique count: 4 >>> c3.invalid_category() 2 >>> c3.isnan() # Only the invalid category returns True for Cat.isnan. FastArray([False, True, False, False]) >>> c3.expand_array.isnan() # Only the integer NaN returns True for FA.isnan. FastArray([False, False, False, True])
- property ordered: bool
If the categorical is tagged as ordered, the unique categories will remain in the order they were provided in.
ordered
is also true if a sort was performed when generating the unique categories.
- property sorted: bool
If the categorical is tagged as sorted, it can use a binary search when performing a lookup in the unique categories.
If a sorted groupby operation is performed, no sort will need to be applied.
- property transform
TO BE DEPRECATED
Examples
>>> c = rt.Categorical(ds.symbol) >>> c.transform.sum(ds.TradeSize)
- property unique_count
Number of unique values in the categorical. It is necessary for every groupby operation.
Notes
For categoricals in dict / enum mode that have generated their grouping object, this will reflect the number of unique values that
occur
in the non-unique values. Empty bins will not be included in the count.
- property unique_repr
- DebugMode = False
- GroupingDebugMode = False
- MetaDefault
- MetaVersion = 1
- TestIsMemberVerbose = False
- _test_cat_ismember = ''
- __arrow_array__(type=None)[source]
Implementation of the
__arrow_array__
protocol for conversion to a pyarrow array.- Parameters:
type (pyarrow.DataType, optional, defaults to None) –
- Return type:
Notes
- __getitem__(fld)[source]
Indexing: Bracket indexing for Categoricals will always hit the FastArray of indices/codes first. If indexed by integer, the retrieved index or code will be passed to the Categories object so the corresponding Category can be returned. Otherwise, a new Categorical will be returned, using the same Categories as the original Categorical with a different index/code array.
The following examples will use this Categorical:
>>> c = rt.Categorical(['a','a','a','b','c','a','b']) >>> c Categorical([a, a, a, b, c, a, b]) Length: 7 FastArray([1, 1, 1, 2, 3, 1, 2], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
Single Integer:
For convenience, any bytestrings will be returned/displayed as unicode strings.
>>> c[3] 'b'
Multiple Integers:
>>> c[[1,2,3,4]] Categorical([a, a, b, c]) Length: 4 FastArray([1, 1, 2, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c[np.arange(4,6)] Categorical([c, a]) Length: 2 FastArray([3, 1], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
Boolean Array:
>>> mask = FastArray([False, True, True, True, True, True, False]) >>> c[mask] Categorical([a, a, b, c, a]) Length: 5 FastArray([1, 1, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
Slice:
>>> c[2:5] Categorical([a, b, c]) Length: 3 FastArray([1, 2, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
- __setitem2__(key, value)[source]
Use grouping object isin, single item accessor instead of Categories object.
- __setitem__(index, value)[source]
- Parameters:
index (int or string (depends on category mode)) –
value (sequence or scalar value) – The value may represent a category or category index.
- Raises:
- _as_meta_data(name=None)[source]
- Parameters:
name (string, optional) – If not specified, will attempt to get name with get_name(), otherwise use class name.
- Returns:
arrdict (dictionary) – Dictionary of column names -> arrays. Extra columns (for unique categories) will have the name+’!’ before their keys.
arrtypes (list) – List of SDSFlags, same length as arrdict.
meta (json-encoded string) – Meta data for the categorical.
See also
- _build_sds_meta_data(name, **kwargs)[source]
Generates meta data from calling categorical, assembles arrays to represent its unique categories.
- Parameters:
name (name of the categorical in the calling structure, or Categorical by default) –
- Returns:
meta (MetaData) – Metadata object for final save
cols (list of FastArray) – arrays to represent unique categories - regardless of CategoryMode
tups (tuples with names of addtl. cols - still determining enum for second item in tuple (will relate to multiday load/concatenation)) – names will be in the format ‘name!col_’ followed by column number
- _categorical_compare_check(func_name, other)[source]
Converts a category to a valid index for faster logical comparison operations on the underlying index fastarray.
- _category_make_unique_multi_key()[source]
Remove duplicated categories by replacing categories with the unique set and remapping codes. Gets out early if categories are already unique.
- _expand_array(arr, index=None)[source]
Internal routine to h-stack an invalid with an array for re-expanding single or multikey categoricals. This allows invalids to be retained in the re-expanded array(s)
- static _from_arrow(arr, zero_copy_only=True, writable=False)[source]
Create a
Categorical
instance from a dictionary-encodedpyarrow.Array
.For certain special cases, namely
CategoryMode.IntEnum
,CategoryMode.Dictionary
, andCategoryMode.MultiKey
, this method accepts an instance ofpyarrow.Table
, sinceCategorical
instances with theseCategoryMode`s don't have an encoding in pyarrow that'd directly preserve their structure. (For example, the direct mapping between the case labels and values for a `CategoryMode.IntEnum
orCategoryMode.Dictionary
-modeCategorical
.)- Parameters:
arr (pyarrow.Array or pyarrow.ChunkedArray) – Must be a dictionary-encoded pyarrow array or a
Struct
-type array (e.g.pyarrow.StructArray
).zero_copy_only (bool, optional, defaults to True) –
writable (bool, optional, defaults to False) –
- Return type:
- classmethod _from_maybe_non_unique_labels(values, categories, base_index=1)[source]
Remove duplicated categories by replacing categories with the unique set and remapping codes. Gets out early if categories are already unique.
- _getsingleitem(fld)[source]
If the getitem indexing operation returned a scalar, translate it according to how the uniques are being held.
- Return type:
Scalar or tuple based on unique type.
- _ipython_key_completions_()[source]
For tab completions with bracket indexing (__getitem__) The IPython completer needs a python list or dict keys/values. If no return (e.g. multikey categorical), return an empty list. Also returns empty if categorical has > 10_000 unique values. If an IPython environment is detected, the ‘greedy’ property is set to True in riptable’s __init__
- classmethod _load_from_sds_meta_data(name, arr, cols, meta)[source]
Builds a categorical object from metadata and arrays.
Will translate metadata, array/column layout from older versions to be compatible with current loader. Raises an error if the metadata version is higher than the class’s meta version (user will need to update riptable)
- Parameters:
name (item's name in the calling container, or the classname Categorical by default) –
arr (the underlying index array for the categorical) –
cols (additional arrays to rebuild unique categories) –
meta (meta data generated by build_sds_meta_data() routine) –
- Returns:
Reconstructed categorical object.
- Return type:
Examples
>>> m = y._build_sds_meta_data('y') >>> rt.Categorical._load_from_sds_meta_data('y', y._fa, m[1], m[0])
- _prepend_invalid(arr)[source]
For base index 1 categoricals, add the invalid category to the beginning of the array of unique categories.
- Parameters:
arr (FastArray) – The array holding the unique category values for this Categorical. This array may be a
FastArray
or a subclass ofFastArray
.- Returns:
An array of the same type as
arr
whose length islen(arr) + 1
, where the first (0th) element of the array is the invalid value for that array type.- Return type:
- static _transformed_scalar_compiled_numba_apply(iGroup, iFirstGroup, nCountGroup, userfunc, args)[source]
- classmethod align(cats)[source]
Cats must be a list of categoricals. The unique categories will be merged into a new unique list. The indices will be fixed to point to the new category array.
- Return type:
A list of (possibly) new categoricals which share the same categories (and thus bin numbering).
Examples
>>> c1 = rt.Categorical(['a','b','c']) >>> c2 = rt.Categorical(['d','e','f']) >>> c3 = rt.Categorical(['c','f','z']) >>> rt.Categorical.align([c1,c2,c3]) [Categorical([a, b, c]) Length: 3 FastArray([1, 2, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c', b'd', b'e', b'f', b'z'], dtype='|S1') Unique count: 7 Categorical([d, e, f]) Length: 3 FastArray([4, 5, 6], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c', b'd', b'e', b'f', b'z'], dtype='|S1') Unique count: 7 Categorical([c, f, z]) Length: 3 FastArray([3, 6, 7], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c', b'd', b'e', b'f', b'z'], dtype='|S1') Unique count: 7]
- apply(userfunc=None, *args, dataset=None, **kwargs)[source]
See Grouping.apply for examples. Categorical needs remove unused bins from its uniques before an apply.
- apply_nonreduce(userfunc=None, *args, dataset=None, **kwargs)[source]
See GroupByOps.apply_nonreduce for examples. Categorical needs remove unused bins from its uniques before an apply.
- as_singlekey(ordered=False, sep='_')[source]
Normalizes categoricals by returning a base 1 single key categorical.
Enum or dict based categoricals will be converted to single key categoricals. Multikey categoricals will be converted to single key categoricals. If the categorical is already single key, base 0 it will be returned as base 1. If the categorical is already single key, base 1 it will be returned as is.
- Parameters:
ordered (bool, defaults False) – whether or not to sort the result
sep (char, defaults ='_') – only valid for multikey since this is the multikey separator
Examples
>>> c=rt.Cat([5, -3, 7], {-3:'one', 2:'two', 5: 'three', 7:'four'}) >>> d=c.as_singlekey() >>> c._fa FastArray([ 5, -3, 7])
>>> d._fa FastArray([3, 2, 1], dtype=int8)
- Return type:
A single key base 1 categorical.
- auto_add_off()[source]
Sets the _auto_add_categories flag to False. Category assignment with a non-existing categorical will raise an error.
Examples
>>> c = rt.Categorical(['a','a','b','c','a'], auto_add_categories=True) >>> c._categories FastArray([b'a', b'b', b'c'], dtype='|S1') >>> c.auto_add_off() >>> c[0] = 'z' ValueError: Cannot automatically add categories [b'z'] while auto_add_categories is set to False.
- auto_add_on()[source]
If the categorical is unlocked, this sets the _auto_add_categories flag to be True. If _auto_add_categories is set to False, the following assignment will raise an error. If the categorical is locked, auto_add_on() will warn the user and the flag will not change.
Examples
>>> c = rt.Categorical(['a','a','b','c','a']) >>> c._categories FastArray([b'a', b'b', b'c'], dtype='|S1') >>> c.auto_add_on() >>> c[0] = 'z' >>> print(c) z, a, b, c, a >>> c._categories FastArray([b'a', b'b', b'c', b'z'], dtype='|S1')
- categories(showfilter=True)[source]
If the categories are stored in a single array or single-key dictionary, an array will be returned. If the categories are stored in a multikey dictionary, a dictionary will be returned. If the categories are a mapping, a dictionary of the mapping will be returned (int -> string)
Note: you can also request categories in a certain format when possible using properties:
category_array
,category_dict
,category_mapping
.- Parameters:
showfilter (bool, defaults to True) – If True (default), the invalid category will be prepended to the returned array or multikey columns. Does not apply when mapping is returned.
- Return type:
np.ndarray or dict
Examples
>>> c = rt.Categorical(['a','a','b','c','d']) >>> c.categories() FastArray([b'Inv', b'a', b'b', b'c', b'd'], dtype='|S1')
>>> c = rt.Categorical([rt.arange(3), rt.FA(['a','b','c'])]) >>> c.categories() {'key_0': FastArray([-2147483648, 0, 1, 2]), 'key_1': FastArray([b'Inv', b'a', b'b', b'c'], dtype='|S3')}
>>> c = rt.Categorical(rt.arange(3), {'a':0, 'b':1, 'c':2}) >>> c.categories() {0: 'a', 1: 'b', 2: 'c'}
- classmethod categories_equal(cats)[source]
Check if every
Categorical
or array has the same categories (same unique values in the same order).- Parameters:
cats (list of Categorical or np.ndarray or tuple of np.ndarray) –
cats
must be a list ofCategorical
objects or arrays that can be converted toCategorical
objects.- Returns:
match (bool) – True if every
Categorical
has the same categories (same unique values in same order), otherwise False.fixed_cats (list of Categorical) – List of
Categorical
objects which may have been fixed up.
Notes
TODO: Can the type annotation for
cats
be relaxed to Collection instead of List?
- category_make_unique()[source]
Remove duplicated categories by replacing categories with the unique set and remapping codes. Gets out early if categories are already unique.
- category_remove(value)[source]
Performance may suffer as indices need to be fixed up. All previous matches to the removed category will be flipped to invalid.
- copy(categories=None, ordered=None, sort_gb=None, lex=None, base_index=None, filter=None, dtype=None, unicode=None, invalid=None, auto_add=False, from_matlab=False, _from_categorical=None, deep=True, order='K')[source]
Return a copy of the input
FastArray
.- Parameters:
order ({'K', 'C', 'F', 'A'}, default 'K') – Controls the memory layout of the copy: ‘K’ means match the layout of the input array as closely as possible; ‘C’ means row-based (C-style) order; ‘F’ means column-based (Fortran-style) order; ‘A’ means ‘F’ if the input array is formatted as ‘F’, ‘C’ if not.
- Returns:
A copy of the input
FastArray
.- Return type:
See also
Categorical.copy
Return a copy of the input
Categorical
.Dataset.copy
Return a copy of the input
Dataset
.Struct.copy
Return a copy of the input
Struct
.
Examples
Copy a
FastArray
:>>> a = rt.FA([1, 2, 3, 4, 5]) >>> a FastArray([1, 2, 3, 4, 5]) >>> a2 = a.copy() >>> a2 FastArray([1, 2, 3, 4, 5]) >>> a2 is a False # The copy is a separate object.
- copy_invalid()[source]
Return a copy of a
FastArray
filled with the invalid value for the array’s data type.- Returns:
A copy of the input array, filled with the invalid value for the array’s dtype.
- Return type:
See also
FastArray.inv
Return the invalid value for the input array’s dtype.
FastArray.fill_invalid
Replace the values of a
FastArray
with the invalid value for the array’s dtype.
Examples
Copy an integer array and replace with invalids:
>>> a = rt.FA([1, 2, 3, 4, 5]) >>> a FastArray([1, 2, 3, 4, 5]) >>> a2 = a.copy_invalid() >>> a2 FastArray([-2147483648, -2147483648, -2147483648, -2147483648, -2147483648]) >>> a FastArray([1, 2, 3, 4, 5]) # a is unchanged.
Copy a floating-point array and replace with invalids:
>>> a3 = rt.FA([0., 1., 2., 3., 4.]) >>> a3 FastArray([0., 1., 2., 3., 4.]) >>> a3.copy_invalid() FastArray([nan, nan, nan, nan, nan])
Copy a string array and replace with invalids:
>>> a4 = rt.FA(['AMZN', 'IBM', 'MSFT', 'AAPL']) >>> a4 FastArray([b'AMZN', b'IBM', b'MSFT', b'AAPL'], dtype='|S4') >>> a4.copy_invalid() FastArray([b'', b'', b'', b''], dtype='|S4') # Invalid string value is an empty string.
- count(filter=None, transform=False)[source]
Count the number of times each value appears in a
Categorical
.Unlike other
Categorical
operations, this does not take a parameter for data.- Parameters:
filter (array of bool, optional) –
Categorical
values that correspond toFalse
filter values are excluded from the count. The filter array must be the same length as theCategorical
.transform (bool, default False) – Set to
True
to return aDataset
that’s the length of theCategorical
, with counts aligned to the ungroupedCategorical
values. Only the counts are included.
- Returns:
A
Dataset
containing each unique category and its count. Iftransform
isTrue
, theDataset
is the same length as the originalCategorical
and contains only the counts.- Return type:
See also
rt_grouping.Grouping.count()
Called by this method.
rt_categorical.Categorical.unique_count()
Return the number of unique values in a
Categorical
.rt_fastarray.FastArray.count()
Return the unique values of a
FastArray
and their counts.
Examples
Create a
Categorical
and count its values:>>> c = rt.Categorical(["a", "a", "b", "c", "a", "c"]) >>> c Categorical([a, a, b, c, a, c]) Length: 6 FastArray([1, 1, 2, 3, 1, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c.count() *key_0 Count ------ ----- a 3 b 1 c 2 [3 rows x 2 columns] total bytes: 15.0 B
Filter based on
Categorical
values:>>> f = (c == "a") >>> c.count(filter=f) *key_0 Count ------ ----- a 3 b 0 c 0 [3 rows x 2 columns] total bytes: 15.0 B
Filter based on a separate array of values:
>>> vals = rt.arange(6) >>> f = (vals > 2) >>> c.count(filter=f) *key_0 Count ------ ----- a 1 b 0 c 2 [3 rows x 2 columns] total bytes: 15.0 B
With
transform=True
, aDataset
is returned with counts aligned to the ungroupedCategorical
values:>>> c.count(transform=True) # Count - ----- 0 3 1 3 2 1 3 2 4 3 5 2 [6 rows x 1 columns] total bytes: 24.0 B
- static display_convert_func(item, itemformat)[source]
Used in conjunction with display_query_properties for final display of a categorical in a dataset. Removes quotation marks from multikey categorical tuples so display is easier to read.
- display_query_properties()[source]
Takes over display query properties for fastarray. By default, all categoricals will use left alignment.
- expand_any(categories)[source]
- Parameters:
categories (list or np.ndarray same size as categories array) –
- Return type:
A re-expanded array of mapping categories passed in.
Examples
>>> c = rt.Categorical(['a','a','b','c','a']) >>> c.expand_any(['d','e','f']) FastArray(['d', 'd', 'e', 'f', 'd'], dtype='<U8')
- fill_backward(*args, limit=0, fill_val=None, inplace=False)[source]
Replace NaN and invalid array values by propagating the next encountered valid group value backward.
Optionally, you can modify the original array if it’s not locked.
- Parameters:
*args (array or list of arrays) – The array or arrays that contain NaN or invalid values you want to replace.
limit (int, default 0 (disabled)) – The maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled. If no
limit
is specified, all consecutive NaN and invalid values are replaced.fill_val (scalar, default None) – The value to use where there is no valid group value to propagate backward. If
fill_val
is not specified, NaN and invalid values aren’t replaced where there is no valid group value to propagate backward.inplace (bool, default False) – If False, return a copy of the array. If True, modify original data. This will modify any other views on this object. This fails if the array is locked.
- Returns:
The
Categorical
will be the same size and have the same dtypes as the original input.- Return type:
See also
Categorical.fill_forward
Replace NaN and invalid array values with the last valid group value.
GroupBy.fill_backward
Replace NaN and invalid array values with the next valid group value.
riptable.fill_backward
Replace NaN and invalid values with the next valid value.
Dataset.fillna
Replace NaN and invalid values with a specified value or nearby data.
FastArray.fillna
Replace NaN and invalid values with a specified value or nearby data.
Examples
>>> cat = rt.Categorical(['A', 'B', 'A', 'B', 'A', 'B']) >>> x = rt.FA([rt.nan, rt.nan, 2, 3, 4, 5]) >>> cat.fill_backward(x) *gb_key_0 col_0 --------- ----- A 2.00 B 3.00 A 2.00 B 3.00 A 4.00 B 5.00
Use a
fill_val
to replace values where there’s no valid group value to propagate backward:>>> x = rt.FastArray([0, 1, 2, 3, rt.nan, rt.nan]) >>> cat.fill_backward(x, fill_val = 0)[0] FastArray([0., 1., 2., 3., 0., 0.])
Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values in a group:
>>> x = rt.FastArray([rt.nan, rt.nan, rt.nan, rt.nan, 4, 5]) >>> cat.fill_backward(x, limit = 1)[0] FastArray([nan, nan, 4., 5., 4., 5.])
- fill_forward(*args, limit=0, fill_val=None, inplace=False)[source]
Replace NaN and invalid array values by propagating the last encountered valid group value forward.
Optionally, you can modify the original array if it’s not locked.
- Parameters:
*args (array or list of arrays) – The array or arrays that contain NaN or invalid values you want to replace.
limit (int, default 0 (disabled)) – The maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled. If no
limit
is specified, all consecutive NaN and invalid values are replaced.fill_val (scalar, default None) – The value to use where there is no valid group value to propagate forward. If
fill_val
is not specified, NaN and invalid values aren’t replaced where there is no valid group value to propagate forward.inplace (bool, default False) – If False, return a copy of the array. If True, modify original data. This will modify any other views on this object. This fails if the array is locked.
- Returns:
The
Categorical
will be the same size and have the same dtypes as the original input.- Return type:
See also
Categorical.fill_backward
Replace NaN and invalid array values with the next valid group value.
GroupBy.fill_forward
Replace NaN and invalid array values with the last valid group value.
riptable.fill_forward
Replace NaN and invalid values with the last valid value.
Dataset.fillna
Replace NaN and invalid values with a specified value or nearby data.
FastArray.fillna
Replace NaN and invalid values with a specified value or nearby data.
Examples
>>> cat = rt.Categorical(['A', 'B', 'A', 'B', 'A', 'B']) >>> x = rt.FastArray([0, 1, 2, 3, rt.nan, rt.nan]) >>> cat.fill_forward(x) *gb_key_0 col_0 --------- ----- A 0.00 B 1.00 A 2.00 B 3.00 A 2.00 B 3.00
Use a
fill_val
to replace values where there’s no valid group value to propagate forward:>>> x = rt.FastArray([rt.nan, rt.nan, 2, 3, 4, 5]) >>> cat.fill_forward(x, fill_val = 0)[0] FastArray([0., 0., 2., 3., 4., 5.])
Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values in a group:
>>> x = rt.FastArray([0, 1, rt.nan, rt.nan, rt.nan, rt.nan]) >>> cat.fill_forward(x, limit = 1)[0] FastArray([ 0., 1., 0., 1., nan, nan])
- fill_invalid(shape=None, dtype=None, order=None, inplace=True)[source]
Returns a Categorical full of invalids, with reference to same categories. Must be base index 1.
- filtered_set_name(name)[source]
Set the name or value that will be displayed for filtered categories. Default is FILTERED_LONG_NAME
- from_bin(bin)[source]
Returns the category corresponding to a single integer. Raises error if index is out of range (accounts for base index) - or does not exist in mapping.
Notes
String values will appear as the scalar type they are stored in, however FastArray, Categorical, and other riptable routines will convert/compensate for unicode/bytestring mismatches.
Examples
Base-1 Indexing:
>>> c = rt.Categorical(['a','a','b','c','a']) >>> c.category_array FastArray([b'a', b'b', b'c'], dtype='|S1') >>> c.category_from_bin(2) b'b'
>>> c.category_from_bin(4) IndexError
Base-0 Indexing:
>>> c = rt.Categorical(['a','a','b','c','a'], base_index=0) >>> c.category_from_bin(2) b'c'
- from_category(category)[source]
Returns the bin associated with a category. If the category doesn’t exist, an error will be raised.
Note: the bin returned is the value as it appears in the underlying integer FastArray. It may not be a direct index into the stored unique categories.
Unicode/bytes conversion will be handled internally.
Examples
Single Key (base-1):
>>> c = rt.Categorical(['a','a','b','c','a']) >>> c.bin_from_category('a') 1 >>> c = rt.Categorical(['a','a','b','c','a']) >>> c.bin_from_category(b'c') 3
Single Key (base-0):
>>> c = rt.Categorical(['a','a','b','c','a'], base_index=0) >>> c.bin_from_category('a') 0
Multikey:
>>> c = rt.Categorical([rt.FA(['a','b','c']), rt.arange(3)]) >>> c.bin_from_category(('a', 0)) 1
Mapping:
>>> c = rt.Categorical([1,2,3], {'a':1, 'b':2, 'c':3}) >>> c.bin_from_category('c') >>> 3
Numeric:
>>> c = rt.Categorical(rt.FA([3.33, 5.55, 6.66])) >>> c.bin_from_category(3.33) 1
- static full(size, value)[source]
Create a
Categorical
of a given length, filled with a single value.- Parameters:
size (int) – The size/length of the
Categorical
to create.value – The value to be repeated.
- Return type:
Examples
Create a 1D
Categorical
array of length 100_000, filled with the string “example”.>>> rt.Categorical.full(100_000, 'example') Categorical([example, example, example, example, example, ..., example, example, example, example, example]) Length: 100000 FastArray([1, 1, 1, 1, 1, ..., 1, 1, 1, 1, 1], dtype=int8) Base Index: 1 FastArray([b'example'], dtype='|S7') Unique count: 1
- groupby_data_set(ds)[source]
Store data to apply future groupby operations to. This will make the categorical behave like a groupby object that was created from a dataset. If data is specified during an operation, it will be used instead of the stored dataset.
- Parameters:
ds (Dataset) –
Examples
>>> c = rt.Categorical(['a','b','c','c','a','a']) >>> a = np.arange(6) >>> ds = rt.Dataset({'col':a}) >>> c.groupby_data_set(ds) >>> c.sum() *gb_key col ------- --- a 9 b 1 c 5
- groupby_reset()[source]
Resets all lazily evaluated groupby information. The categorical will go back to the state it was in just after construction. This is called any time the categories are modified.
- classmethod hstack(cats)[source]
Cats must be a list of categoricals. The unique categories will be merged into a new unique list. The indices will be fixed to point to the new category array. The indices are hstacks and a new categorical is returned.
Examples
>>> c1 = rt.Categorical(['a','b','c']) >>> c2 = rt.Categorical(['d','e','f']) >>> combined = rt.Categorical.hstack([c1,c2]) >>> combined Categorical([a, b, c, d, e, f]) Length: 6 FastArray([1, 2, 3, 4, 5, 6]) Base Index: 1 FastArray([b'a', b'b', b'c', b'd', b'e', b'f'], dtype='|S1') Unique count: 6
- info()[source]
The three arrays in info: Categories mapped to their indices, often making the categorical appear to be a string array. Length of array. Underlying array of integer indices, dtype. Base index (normally 1 to reserve 0 as an invalid bin for groupby - much better for performance) Categories - list or dictionary
The CategoryMode is also displayed:
Mode:
Default - no example StringArray - categories are held in a single string array IntEnum - categories are held in a dictionary generated from an IntEnum Dictionary - categories are held in a dictionary generated from a code-mapping dictionary NumericArray - categories are held in a single numeric array MultiKey - categories are held in a dictionary (when constructed with multikey, or numeric categories the groupby hash does the binning)
Locked:
If True, categories may be changed.
- invalid_set(inv)[source]
Set a
Categorical
category to be invalid.An invalid category is specified when the
Categorical
is created or set afterward usingCategorical.invalid_set
. An invalid category is different from a Filtered category or a NaN value.If there’s an existing invalid category in the
Categorical
, usingCategorical.invalid_set
to set a different category causes the existing invalid category to become valid.See also
Categorical.isnan
Find the invalid elements of a
Categorical
.Categorical.isnotnan
Find the valid elements of a
Categorical.
Categorical.invalid_category
The
Categorical
object’s invalid category.
Examples
>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"]) >>> c Categorical([b, a, c, b, c]) Length: 5 FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c.invalid_set("b") >>> c.invalid_category 'b' >>> c.isnan() # Returns True for invalid category. FastArray([ True, False, False, True, False])
Set a new invalid category:
>>> c.invalid_set("a") >>> c.invalid_category 'a' >>> c.isnan() FastArray([False, True, False, False, False])
- isfiltered()[source]
True where bin == 0. Only applies to categoricals with base index 1, otherwise returns all False. Different than invalid category.
See also
- isin(values)[source]
- Parameters:
values (a list-like or single value to be searched for) –
- Returns:
Boolean array with the same size as
self
. True indicates that the array element occured in the providedvalues
.- Return type:
Notes
Behavior differs from pandas in the following ways: * Riptable favors bytestrings, and will make conversions from unicode/bytes to match for operations as necessary. * We also accept single scalars for
values
. * Pandas series will return another series - we have no series, and will return a FastArray.Examples
>>> c = rt.Categorical(['a','b','c','d','e'], unicode=False) >>> c.isin(['a','b']) FastArray([ True, True, False, False, False])
See also
pandas.Categorical.isin
- isna(*args, **kwargs)[source]
See
Categorical.isnan
.
- isnan(*args, **kwargs)[source]
Find the invalid elements of a
Categorical
.An invalid category is specified when the
Categorical
is created or set afterward usingCategorical.invalid_set
. An invalid category is different from a Filtered category or a NaN value.- Returns:
A boolean array the length of the values array where
True
indicates an invalidCategorical
category.- Return type:
See also
Categorical.isnotnan
Find the valid elements of a
Categorical.
Categorical.invalid_category
The
Categorical
object’s invalid category.Categorical.invalid_set
Set a
Categorical
category to be invalid.
Examples
>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b") >>> c Categorical([b, a, c, b, c]) Length: 5 FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c.isnan() FastArray([ True, False, False, True, False])
Invalid categories are different from Filtered categories:
>>> f = rt.FA([True, False, True, True, True]) >>> c2 = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b", filter=f) >>> c2 Categorical([b, Filtered, c, b, c]) Length: 5 FastArray([1, 0, 2, 1, 2], dtype=int8) Base Index: 1 FastArray([b'b', b'c'], dtype='|S1') Unique count: 2 >>> c2.isnan() # Only the invalid category returns True for Cat.isnan. FastArray([ True, False, False, True, False]) >>> c2.isfiltered() # Only the Filtered value returns True for Cat.isfiltered. FastArray([False, True, False, False, False])
Invalid categories in a
Categorical
are different from regular integer NaN values. An integer NaN is a valid category and isFalse
forCat.isnan()
:>>> a = rt.FA([1, 2, 3, 4]) >>> a[3] = a.inv # Set the last value to an integer NaN. >>> a FastArray([ 1, 2, 3, -2147483648]) >>> c3 = rt.Categorical(values=a, invalid=2) # Make 2 an invalid category. >>> c3 Categorical([1, 2, 3, -2147483648]) Length: 4 FastArray([2, 3, 4, 1], dtype=int8) Base Index: 1 FastArray([-2147483648, 1, 2, 3]) Unique count: 4 >>> c3.invalid_category() 2 >>> c3.isnan() # Only the invalid category returns True for Cat.isnan. FastArray([False, True, False, False]) >>> c3.expand_array.isnan() # Only the integer NaN returns True for FA.isnan. FastArray([False, False, False, True])
- isnotnan(*args, **kwargs)[source]
Find the valid elements of a
Categorical.
An invalid category is specified when the
Categorical
is created or set afterward usingCategorical.invalid_set
. An invalid category is different from a Filtered category or a NaN value.- Returns:
A boolean array the length of the values array where
True
indicates a validCategorical
category.- Return type:
See also
Categorical.isnan
Find the invalid elements of a
Categorical.
Categorical.invalid_category
The
Categorical
object’s invalid category.Categorical.invalid_set
Set a
Categorical
category to be invalid.
Examples
>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b") >>> c Categorical([b, a, c, b, c]) Length: 5 FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3 >>> c.isnotnan() FastArray([False, True, True, False, True])
Invalid categories are different from Filtered categories:
>>> f = rt.FA([True, False, True, True, True]) >>> c2 = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b", filter=f) >>> c2 Categorical([b, Filtered, c, b, c]) Length: 5 FastArray([1, 0, 2, 1, 2], dtype=int8) Base Index: 1 FastArray([b'b', b'c'], dtype='|S1') Unique count: 2 >>> c2.isnotnan() # Only the invalid category returns False for Cat.isnotnan. FastArray([False, True, True, False, True]) >>> ~c2.isfiltered() # Only the Filtered value returns False for the negation of Cat.isfiltered. FastArray([ True, False, True, True, True])
Invalid categories in a
Categorical
are different from regular integer NaN values. An integer NaN is a valid category and isTrue
forCat.isnotnan()
:>>> a = rt.FA([1, 2, 3, 4]) >>> a[3] = a.inv # Set the last value to an integer NaN. >>> a FastArray([ 1, 2, 3, -2147483648]) >>> c3 = rt.Categorical(values=a, invalid=2) # Make 2 an invalid category. >>> c3 Categorical([1, 2, 3, -2147483648]) Length: 4 FastArray([2, 3, 4, 1], dtype=int8) Base Index: 1 FastArray([-2147483648, 1, 2, 3]) Unique count: 4 >>> c3.invalid_category() 2 >>> c3.isnotnan() # Only the invalid category returns False for Cat.isnotnan. FastArray([ True, False, True, True]) >>> c3.expand_array.isnotnan() # Only the integer NaN returns False for FA.isnotnan. FastArray([ True, True, True, False])
- map(mapper, invalid=None)[source]
Maps existing categories to new categories and returns a re-expanded array.
- Parameters:
mapper (dictionary or numpy.array or FastArray) –
dictionary maps existing categories -> new categories
array must be the same size as the existing category array
invalid – Optionally specify an invalid value to insert for existing categories that were not found in the new mapping. If no invalid is set, the default invalid for the result’s dtype will be used.
- Returns:
Re-expanded array.
- Return type:
Notes
Maybe to add: - option to return categorical instead of re-expanding - dtype for return array
Examples
New strings (all exist, no invalids in original):
>>> c = rt.Categorical(['b','b','c','a','d'], ordered=False) >>> mapping = {'a': 'AA', 'b': 'BB', 'c': 'CC', 'd': 'DD'} >>> c.map(mapping) FastArray([b'BB', b'BB', b'CC', b'AA', b'DD'], dtype='|S3')
New strings (not all exist, no invalids in original):
>>> mapping = {'a': 'AA', 'b': 'BB', 'c': 'CC'} >>> c.map(mapping, invalid='INVALID') FastArray([b'BB', b'BB', b'CC', b'AA', b'INVALID'], dtype='|S7')
String to float:
>>> mapping = {'a': 1., 'b': 2., 'c': 3.} >>> c.map(mapping, invalid=666) FastArray([ 2., 2., 3., 1., 666.])
If no invalid is specified, the default invalid will be used:
>>> c.map(mapping) FastArray([ 2., 2., 3., 1., nan])
Mapping as array (must be the same size):
>>> mapping = rt.FastArray(['w','x','y','z']) >>> c.map(mapping) FastArray([b'w', b'w', b'x', b'y', b'z'], dtype='|S3')
- mapping_new(mapping)[source]
Replace entire mapping dictionary. No codes in the Categorical’s integer FastArray will be changed. If they are not in the new mapping, they will appear as Invalid.
- classmethod newclassfrominstance(instance, origin)[source]
Used when the FastArray portion of the Categorical is updated, but not the reset of the class attributes.
Examples
>>> c=rt.Cat(['a','b','c']) >>> rt.Cat.newclassfrominstance(c._fa[1:2],c) Categorical([b]) Length: 1 FastArray([2], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
- notna(*args, **kwargs)[source]
See
Categorical.isnotnan
.
- nth(arr, n=1, transform=None, filter=None, showfilter=None)[source]
Select the nth row from each group.
- Parameters:
arr (array or list of array) – The array of values to select from.
n (int) – A single nth value for the row.
transform (bool) – If
True
, the output will have the same shape asarr
. IfFalse
, the output will typically have the same shape as theCategorical
.filter (array of bool, optional) – Elements to include in the operation.
showfilter (bool) – If
True
, the output contains an extra row representing the operation applied to a stack of all the elements that were filtered out (both atCategorical
creation and in this operation, using a filter.)
Examples
>>> ds = rt.Dataset({'A': rt.Categorical(['a', 'a', 'b', 'a', 'b']), ... 'B': [rt.nan, 2, 3, 4, 5]}) >>> c = ds.A >>> c.nth([ds.A, ds.B], 0) *A B -- ---- a nan b 3.00 [2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth([ds.A, ds.B], 1) *A B -- ---- a 2.00 b 5.00 [2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth([ds.A, ds.B], -1) *A B -- ---- a 4.00 b 5.00 [2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth(ds.B, -2, transform=True) # B - ---- 0 2.00 1 2.00 2 3.00 3 2.00 4 3.00 [5 rows x 1 columns] total bytes: 40.0 B
>>> c.nth(ds.B, 1, filter=ds.B.isnotnan()) *A B -- ---- a 4.00 b 5.00 [2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth(ds.B, -2, filter=ds.A!='b', showfilter=True) *A B -------- ---- Filtered 3.00 a 2.00 b nan [3 rows x 2 columns] total bytes: 48.0 B
- numba_apply(userfunc, *args, filter=None, transform=False, **kwargs)[source]
Applies a user numba function over the groups of a categorical. Numba function should either return a scalar or np.array the size of the input array. If numba function returns scalar, set transform = True to reshape result to size of categorical.
- Parameters:
userfunc (a numba function) –
args (a np.array, userfunc must return scalar or np.array of same length) –
filter (boolean filter) –
kwargs (kwargs to pass to userfunc) –
transform (Set to true if userfunc returns a scalar, but you want re-expanded to the size of original array) –
- Return type:
Dataset with categorical keys for scalar function with transform = False, otherwise aligned to original categorical
- nunique()[source]
Number of unique values that occur in the Categorical. Does not include invalids. Not the same as the length of possible uniques.
Categoricals based on dictionary mapping / enum will return unique count including all possibly invalid values from underlying array.
See also
- one_hot_encode(dtype=None, categories=None, return_labels=True)[source]
Generate one hot encoded arrays from each unique category.
- Parameters:
dtype (data-type, optional) – The numpy data type to use for the one-hot encoded arrays. If
dtype
is not specified (i.e. isNone
), the encoded arrays will default to using anp.float32
representation.categories (list or array-like, optional) – List or array containing unique category values to one-hot encode. Specify this when you only want to encode a subset of the unique category values. Defaults to None, in which case all categories are encoded.
return_labels (bool) – Not implemented.
- Returns:
col_names (FastArray) – FastArray of column names (unique categories as unicode strings)
encoded_arrays (list of FastArray) – list of one-hot encoded arrays for each category
Notes
Unicode is used because the column names are often going to a dataset.
Performance warning for large amount of uniques - an array will be generated for ALL of them
Examples
Default:
>>> c = rt.Categorical(FA(['a','a','b','c','a'])) >>> c.one_hot_encode() (FastArray(['a', 'b', 'c'], dtype='<U1'), [FastArray([1., 1., 0., 0., 1.], dtype=float32), FastArray([0., 0., 1., 0., 0.], dtype=float32), FastArray([0., 0., 0., 1., 0.], dtype=float32)])
Custom dtype:
>>> c.one_hot_encode(dtype=np.int8) c.one_hot_encode(dtype=np.int8) (FastArray(['a', 'b', 'c'], dtype='<U1'), [FastArray([1, 1, 0, 0, 1], dtype=int8), FastArray([0, 0, 1, 0, 0], dtype=int8), FastArray([0, 0, 0, 1, 0], dtype=int8)])
Specific categories:
>>> c.one_hot_encode(categories=['a','b']) (FastArray(['a', 'b'], dtype='<U1'), [FastArray([ True, True, False, False, True]), FastArray([False, False, True, False, False])])
Multikey:
>>> #NOTE: The double-quotes in the category names are not part of the actual string. >>> c = rt.Categorical([rt.FA(['a','a','b','c','a']), rt.FA([1, 1, 2, 3, 1]) ] ) >>> c.one_hot_encode() (FastArray(["('a', '1')", "('b', '2')", "('c', '3')"], dtype='<U10'), [FastArray([1., 1., 0., 0., 1.], dtype=float32), FastArray([0., 0., 1., 0., 0.], dtype=float32), FastArray([0., 0., 0., 1., 0.], dtype=float32)])
Mapping:
>>> c = rt.Categorical(rt.arange(3), {'a':0, 'b':1, 'c':2}) >>> c.one_hot_encode() (FastArray(['a', 'b', 'c'], dtype='<U1'), [FastArray([1., 0., 0.], dtype=float32), FastArray([0., 1., 0.], dtype=float32), FastArray([0., 0., 1.], dtype=float32)])
- set_name(name)[source]
If the grouping dict contains a single item, rename it.
See also
Grouping.set_name
,FastArray.set_name
- set_valid(filter=None)[source]
Apply a filter to the categorical’s values. If values no longer occur in the uniques, the uniques will be reduced, and the index will be recalculated.
- Parameters:
filter (boolean array, optional) – If provided, must be the same size as the categorical’s underlying array. Will be used to mask non-unique values. If not provided, categorical may still reduce its unique values to the unique occuring values.
- Returns:
c – New categorical with possibly reduced uniques.
- Return type:
- shift(arr, window=None, *, periods=None, filter=None)[source]
Shift values in each group by the specified number of periods.
Where the shift introduces a missing value, the missing value is filled with the invalid value for the array’s data type (for example, NaN for floating-point arrays or the sentinel value for integer arrays).
- Parameters:
arr (array or list of array) – The array of values to shift.
window (int, default 1) – The number of periods to shift. Can be a negative number to shift values backward.
periods (int, optional, default 1) – Can use
periods
instead ofwindow
for Pandas parameter support.filter (FastArray of bool, optional) – Set of rows to include. Filtered out rows are skipped by the shift and become NaN in the output.
- Returns:
A
Dataset
containing a column of shifted values.- Return type:
See also
Categorical.shift_cat
Shift the values of a
Categorical
.FastArray.shift
Shift the values of a
FastArray
.DateTimeNano.shift
Shift the values of a
DateTimeNano
array.
Examples
With the default
window=1
:>>> c = rt.Cat(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']) >>> fa = rt.arange(9) >>> shift_val = c.shift(fa) >>> shift_val # col_0 - ----- 0 Inv 1 0 2 1 3 Inv 4 3 5 4 6 Inv 7 6 8 7
With
window=2
:>>> shift_val_2 = c.shift(fa, window=2) >>> shift_val_2 # col_0 - ----- 0 Inv 1 Inv 2 0 3 Inv 4 Inv 5 3 6 Inv 7 Inv 8 6
With
window=-1
:>>> shift_neg = c.shift(fa, window=-1) >>> shift_neg # col_0 - ----- 0 1 1 2 2 Inv 3 4 4 5 5 Inv 6 7 7 8 8 Inv
With
filter
:>>> filt = rt.FA([True, True, True, True, False, True, False, True, True]) >>> shift_filt = c.shift(fa, filter=filt) >>> shift_filt # col_0 - ----- 0 Inv 1 0 2 1 3 Inv 4 Inv 5 3 6 Inv 7 Inv 8 7
Results put in a
Dataset
to show the shifts in relation to the categories:>>> ds = rt.Dataset() >>> ds.c = c >>> ds.shift_val = shift_val >>> ds.shift_val_2 = shift_val_2 >>> ds.shift_neg = shift_neg >>> ds # c shift_val shift_val_2 shift_neg - - --------- ----------- --------- 0 a Inv Inv 1 1 a 0 Inv 2 2 a 1 0 Inv 3 b Inv Inv 4 4 b 3 Inv 5 5 b 4 3 Inv 6 c Inv Inv 7 7 c 6 Inv 8 8 c 7 6 Inv
Shift two arrays:
>>> fa2 = rt.arange(10, 19) >>> shift_val_3 = c.shift([fa, fa2]) >>> shift_val_3 # col_0 col_1 - ----- ----- 0 Inv Inv 1 0 10 2 1 11 3 Inv Inv 4 3 13 5 4 14 6 Inv Inv 7 6 16 8 7 17
- shift_cat(periods=1)[source]
See FastArray.shift() Instead of nan or sentinel values, like shift on a FastArray, the invalid category will appear. Returns a new categorical.
Examples
>>> rt.Cat(['a','b','c']).shift(1) Categorical([Filtered, a, b]) Length: 3 FastArray([0, 1, 2], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
- shrink(newcats, misc=None, inplace=False)[source]
- Parameters:
newcats (array-like) – New categories to replace the old - typically a reduced set.
misc (scalar, optional (often a string)) – Value to use as category for items not found in new categories. This will be added to the new categories. If not provided, all items not found will be set to a filtered bin.
inplace (bool) – If True, re-index the categorical’s underlying FastArray. Otherwise, return a new categorical with a new index and grouping object.
- Returns:
A new Categorical with the new index.
- Return type:
Examples
Base index 1, no misc
>>> c = rt.Categorical([1,2,3,1,2,3,0], ['a','b','c']) >>> c.shrink(['b','c']) Categorical([Filtered, b, c, Filtered, b, c, Filtered]) Length: 7 FastArray([0, 1, 2, 0, 1, 2, 0]) Base Index: 1 FastArray([b'b', b'c'], dtype='|S1') Unique count: 2
Base index 1, filtered bins and misc
>>> c.shrink(['b','c'], 'AAA').sum(rt.arange(7), showfilter=True) *key_0 col_0 -------- ----- Filtered 6 AAA 3 b 5 c 7
Base index 0, with misc
>>> c = rt.Categorical([0,1,2,0,1,2], ['a','b','c'], base_index=0) >>> c.shrink(['b','c'], 'AAA') Categorical([AAA, b, c, AAA, b, c]) Length: 6 FastArray([0, 1, 2, 0, 1, 2], dtype=int8) Base Index: 0 FastArray(['AAA', 'b', 'c'], dtype='<U3') Unique count: 3
See also
- str()
Casts an array of byte strings or unicode as
FAString
.Enables a variety of useful string manipulation methods.
- Return type:
- Raises:
TypeError – If the FastArray is of dtype other than byte string or unicode
See also
np.chararray
,np.char
,rt.FAString.apply
Examples
>>> s=rt.FA(['this','that','test ']*100_000) >>> s.str.upper FastArray([b'THIS', b'THAT', b'TEST ', ..., b'THIS', b'THAT', b'TEST '], dtype='|S5')
>>> s.str.lower FastArray([b'this', b'that', b'test ', ..., b'this', b'that', b'test '], dtype='|S5')
>>> s.str.removetrailing() FastArray([b'this', b'that', b'test', ..., b'this', b'that', b'test'], dtype='|S5')
- to_arrow(type=None, *, preserve_fixed_bytes=False, empty_strings_to_null=True)[source]
Convert this
Categorical
to apyarrow.Array
.- Parameters:
type (pyarrow.DataType, optional, defaults to None) – Unused.
preserve_fixed_bytes (bool, optional, defaults to False) – Unused.
empty_strings_to_null (bool, optional, defaults To True) – Unused.
- Return type:
Notes
- TODO: Consider whether we should store all Categoricals as Struct-type pyarrow arrays, since that’d
allow us to preserve the key names, even for single-key Categoricals.
- class riptable.rt_categorical.Categories(*args, base_index=1, invalid_category=None, ordered=False, unicode=False, _from_categorical=False, **kwargs)[source]
Holds categories for each Categorical instance. This adds a layer of abstraction to Categorical.
Categories objects are constructed in Categorical’s constructor and other internal routines such as merging operations. The Categories object is responsible for translating the values in the Categorical’s underlying fast array into the correct bin in the categories. It performs different operations to retrieve the correct bins based on it’s mode.
- Parameters:
categories – main categories data - can also be empty list
invalid_category (str) – string that will be displayed for an invalid index
invalid_index – sentinel value for a particular index; this invalid will be displayed differntly in IntEnum/Dictionary modes
ordered (bool) – flag for list list modes, ordered categories can use a binary search for finding bins
auto_add_categories – if a setitem (bracket-indexing with a value) is called, and the value is not in the categories, this flag allows it to be added automatically.
na_added – for some constructors, the calling Categorical has already added the invalid category
base_index – the calling Categorical passes in the index offset for list and grouping modes
multikey – the categories information is stored in a multikey dictionary up for deletion
groupby – possibly merge with the multikey flag
Notes
There are multiple modes in which a Categories object can operate.
StringArray: (list_modes) Two paths for initializations use the categories routines: TB Filled in LATER array and list of unique categories. String mode will be set to unicode or bytes so the correct encoding/decoding can be performed before comparison/searching operations. - from list of strings (unique/ismember) - from list of strings paired with unique string categories (unique/ismember) - from codes paired with unique string categories (assignment will happen without unique/ismember) - from pandas categoricals (with string categories) (assignment will happen without unique/ismember) - from matlab categoricals (with string categories) (assignment will happen without unique/ismember)
NumericArray: (list_modes) this is not currently implemented as default behavior, but if enabled it will handle these constructors - from list of integers - from list of floats - from codes paired with unique integer categories - from codes paired with unique float categories - from list of floats paired with unique float categories - from pandas categoricals with numeric categories
IntEnum / Dictionary: (dict_modes) Two dictionaries will be held: one mapping strings to integers, another mapping integers to strings. This mode requires that all strings and their corresponding codes are one-to-one. - from codes paired with IntEnum object - from codes paired with Integer -> String dictionary - from codes paired wtih String -> Integer dictionary not implemented
Grouping All categories objects in Grouping mode hold categories in a dictionary, even if the dictionary only contains one item. Information for indexed items will appear in a tuple if multiple columns are being held. - from list of key columns - from dictionary of key columns - from single list of numeric type - from dataset not implemented
- property _first_list
Returns the first column when categories are in a dictionary, or the list if the categories are in a list mode.
- property base_index
- property grouping
- property int2strdict
- property isbytes
True if uniques are held in single array of bytes. Otherwise False.
- property isenum
True if uniques have an enum / dictionary mapping for uniques. Otherwise False.
See also: GroupingEnum
- property ismultikey
True if unique dict holds multiple arrays. False if unique dict holds single array or in enum mode.
- property issinglekey
True if unique dict holds single array. False if unique dict hodls multiple arrays or in enum mode.
- property isunicode
True if uniques are held in single array of unicode. Otherwise False.
- property mode
- property ncols: int
Returns the number of key columns in a multikey categorical or 1 if a single key’s categories are being held in a dictionary.
- property str2intdict
- property uniquedict
- property uniquelist
- _grouping: riptable.rt_grouping.Grouping
- default_colname = 'key_0'
- dict_modes
- list_modes
- multikey_spacer = ' '
- numeric_modes
- string_modes
- __len__()[source]
TODO: consider changing length of enum/dict mode categories to be the length of the dictionary. using max int so the calling Categorical can properly recast the integer array.
- _copy(deep=True)[source]
Creates a new categories object and possibly performs a deep copy of category list. Currently only supports Categories in list modes.
- _getitem_enum(value)[source]
At this point, the categorical’s underlying fast array’s __getitem__ has already been hit. It will only execute if the return value was scalar. No need to handle lists/arrays/etc. - which take a different path in Categorical.__getitem__
The value should always be a single integer.
this will return a single item or list of items from int/string index Enums will always return an array of values, even if there is only one entry. Enums dictionaries can only be looked up with unicode strings, so bytes will be converted.
- _possibly_add_categories(new_categories)[source]
Add non-existing categories to categories. If categories were added, an array is returned to fix the old indexes. If no categories were added, returns None.
- classmethod build_dicts_enum(enum)[source]
Builds forward/backward dictionaries from IntEnums. If there are multiple identifiers with the same, WARN!
- classmethod build_dicts_python(python_dict)[source]
Categoricals can be initialized with a dictionary of string to integer or integer to string. Python dictionaries accept multiple types for their keys, so the dictionaries need to check types as they’re being constructed.
- get_categories()[source]
TODO: decide what to return for int enum categories. for now returning list of category strings
- get_category_index(s)[source]
Returns an integer or float for logical comparisons with the Categorical’s index array. Floating point return ensures that LTE/GTE functions work properly
- get_category_match_index(fld)[source]
Returns the indices of matching strings in the unique list. The Categorical instance will compare these integers to those in its underlying array to generate a boolean mask.
- get_multikey_index(multikey)[source]
Multikey categoricals can be indexed by tuple. This is an internal routine for getitem, setitem, and logical comparisons. Valid return will be adjusted for the base index of the categorical (currently always 1 for multikey)
- Parameters:
multikey (tuple of items to search for in multiple columns) –
- Returns:
location of multikey + base index, or -1 if not found
- Return type:
Examples
>>> c = rt.Categorical([rt.arange(5), rt.arange(5)]) >>> c Categorical([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]) Length: 5 FastArray([1, 2, 3, 4, 5], dtype=int8) Base Index: 1 {'key_0': FastArray([0, 1, 2, 3, 4]), 'key_1': FastArray([0, 1, 2, 3, 4])} Unique count: 5
>>> c._categories_wrap.get_multikey_index((0,0)) 1
- riptable.rt_categorical.CatZero(values, categories=None, ordered=None, sort_gb=None, lex=None, base_index=0, **kwargs)[source]
Calls Categorical() with base_index keyword set to 0.
- riptable.rt_categorical.categorical_convert(v, base_index=0)[source]
- Parameters:
v (a pandas categorical) –
- Returns:
Returns the two building blocks to make an rt categorical (integer array, and what that indexes into)
whatever the pandas categorical underlying object is we try to convert it to a string to
detach from object references and free of pandas references
pandas also uses -1 to indicate an out of bounds value, when we detect this, we insert an item in the beginning
Examples
>>> p=pd.Categorical(['a','b','b','a','a','c','b','c','a','a'], categories=['a','b']) >>> test=Categorical(p)
from a cut
>>> a=rt.FA(rt.arange(10.0)+.1) >>> p=pd.cut(a,[0,3,6,7]) (0, 3], (0, 3], (3, 6], (3, 6], (3, 6], (6, 7], NaN, NaN, NaN] >>> test=Categorical(p) Categorical([(0, 3], (0, 3], (0, 3], (3, 6], (3, 6], (3, 6], (6, 7], nan, nan, nan])
- riptable.rt_categorical.categorical_merge_dict(list_categories, return_is_safe=False, return_type=Categorical)[source]
Checks to make sure all unique string values in all dictionaries have the same corresponding integer in every categorical they appear in. Checks to make sure all unique integer values in all dictionaries have the same corresponding string in every categorical they appear in.