riptable.rt_categorical

Classes

Categorical

A Categorical efficiently stores an array of repeated strings and is used for

Categories

Holds categories for each Categorical instance. This adds a layer of abstraction to Categorical.

Functions

CatZero(values[, categories, ordered, sort_gb, lex, ...])

Calls Categorical() with base_index keyword set to 0.

categorical_convert(v[, base_index])

param v:

categorical_merge_dict(list_categories[, ...])

Checks to make sure all unique string values in all dictionaries have the same corresponding integer in every categorical they appear in.

class riptable.rt_categorical.Categorical(values, categories=None, ordered=None, sort_gb=None, sort_display=None, lex=None, base_index=None, filter=None, dtype=None, unicode=None, invalid=None, auto_add=False, from_matlab=False, _from_categorical=None)[source]

Bases: riptable.rt_groupbyops.GroupByOps, riptable.rt_fastarray.FastArray

A Categorical efficiently stores an array of repeated strings and is used for groupby operations.

Riptable Categorical objects have two related uses:

  • They efficiently store string (or other large dtype) arrays that have repeated values. The repeated values are partitioned into groups (a.k.a. categories), and each group is mapped to an integer. The mapping codes allow the data to be stored and operated on more efficiently.

  • They’re Riptable’s class for doing groupby operations. A method applied to a Categorical is applied to each group separately.

A Categorical is typically created from a list of strings:

>>> c = rt.Categorical(["b", "a", "b", "a", "c", "c", "b"])
>>> c
Categorical([b, a, b, a, c, c, b]) Length: 7
  FastArray([2, 1, 2, 1, 3, 3, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

The output shows:

  • The Categorical values. These are grouped into unique categories (here, “a”, “b”, and “c”), which are also stored in the Categorical (see below).

  • The integer mapping codes (also called bins). Each integer is mapped to a unique category (here, 1 is mapped to “a”, 2 is mapped to “b”, and 3 is mapped to “c”). Because these codes can also be used to index into the Categorical, they’re also referred to as indices. By default, the index is 1-based, with 0 reserved for Filtered values.

  • The unique categories. Each category represents a group for groupby operations.

Use Categorical objects to perform aggregations over arbitrary arrays of the same dimension as the Categorical:

>>> c = rt.Categorical(["b", "a", "b", "a", "c", "c", "b"])
>>> ints = rt.FA([3, 10, 2, 5, 4, 1, 1])
>>> flts = rt.FA([1.2, 3.4, 5.6, 4.0, 2.1, 0.6, 11.3])
>>> c.sum([ints, flts])
*key_0   col_0   col_1
------   -----   -----
a           15    7.40
b            6   18.10
c            5    2.70

[3 rows x 3 columns] total bytes: 51.0 B

Multi-Key Categoricals

The Categorical above is a single-key Categorical – it groups one array of values into keys (the categories) for groupby operations.

Multi-key Categorical objects let you create and operate on groupings based on multiple associated categories. The associated keys form a group:

>>> strs = rt.FastArray(["a", "b", "b", "a", "b", "a"])
>>> ints = rt.FastArray([2, 1, 1, 2, 1, 1])
>>> c = rt.Categorical([strs, ints])  # Create a with a list of arrays.
>>> c
Categorical([(a, 2), (b, 1), (b, 1), (a, 2), (b, 1), (a, 1)]) Length: 6
  FastArray([1, 2, 2, 1, 2, 3], dtype=int8) Base Index: 1
  {'key_0': FastArray([b'a', b'b', b'a'], dtype='|S1'), 'key_1': FastArray([2, 1, 1])} Unique count: 3
>>> c.count()
*key_0   *key_1   Count
------   ------   -----
a             2       2
b             1       3
a             1       1

[3 rows x 3 columns] total bytes: 27.0 B

Filtered Values and Categories

Filter values and categories to exclude them from operations on the Categorical.

Categorical objects can be filtered when they’re created or anytime afterwards. Because filtered items are mapped to 0 in the integer mapping array, filters can be used only in base-1 Categorical objects.

Filters can also be applied on a one-off basis at the time of an operation. See the Filtering topic under More About Categoricals for examples.

More About Categorials

For more about using Categorical objects, see the Categoricals section of the Intro to Riptable or these more in-depth topics:

Parameters:
  • values (array of str, int, or float, list of arrays, dict, or Categorical or pandas.Categorical) –

    • Strings: Unicode strings and byte strings are supported.

    • Integers without provided categories: The integer mapping codes start at 1.

    • Integers with provided categories: If you have an array of integers that indexes into an array of provided unique categories, the integers are used for the integer mapping array. Any 0 values are mapped to the Filtered category.

    • Floats are supported with no user-provided categories. If you have a Matlab Categorical with categories, set from_matlab to True. Categorical objects created from Matlab Categoricals must have a base-1 index; any 0.0 values become Filtered.

    • A list of arrays or a dictionary with multiple key-value pairs creates a multi-key Categorical.

    • For a Categorical created from a Categorical, a deep copy of categories is performed.

    • For a Categorical created from a Pandas Categorical, a deep copy is performed and indices start at 1 to preserve invalid values. Categorical objects created from Pandas Catagoricals must have a base-1 index.

  • categories (array of str, int, or float, dict of {str : int} or {int : str}, or IntEnum, optional) –

    The unique categories. Can be:

    • An array of strings, integers, or floats. Floats can be used only when values is numeric. Warning: Non-unique categories may give unexpected results in operations.

    • A dictionary or IntEnum that maps integers to strings or strings to integers. Provided values must be integers.

    Note:

    • User-provided categories are always held in the order provided.

    • Multi-key Categorical objects don’t support user-provided categories.

  • ordered (bool, default None/True) –

    Controls whether categories are sorted lexicographically before they are mapped to integers:

    • If categories are not provided, by default they are sorted. If ordered=False, the order is first appearance unless lex=True. To sort categories for groupby operations, use sort_gb=True (see below).

    • If categories are provided, they are always held in the order they’re provided in; they can’t be sorted with ordered or lex.

  • sort_gb (bool, default None/False) – Controls whether groupby operation results are displayed in sorted order. Note that results may already appear sorted based on ordered or lex settings.

  • sort_display (bool, optional) – See sort_gb.

  • lex (bool, default None/False) – Controls whether hashing- or sorting-based logic is used to find unique values in the input array. By default hashing is used. If more than 50% of the values are unique, set lex=True for a possibly faster lexicographical sort (not supported if categories are provided).

  • base_index ({None, 0, 1}, default None/1) –

    By default, base-1 indexing is used. Base-0 can be used if:

    • A mapping dictionary isn’t used. A Categorical created from a mapping dictionary does not have a base index.

    • A filter isn’t used at creation.

    • A Matlab or Pandas Categorical isn’t being converted. These both reserve 0 for invalid values.

    If base-0 indexing is used, 0 becomes a valid category.

  • filter (array of bool, optional) – Must be the same length as values. Values that are False become Filtered and mapped to 0 in the integer mapping array, and they are ignored in groupby operations. A filter can’t be used with a base-0 Categorical or one created with a mapping dictionary or IntEnum.

  • dtype (riptable.dtype, numpy.dtype, or str, optional) – Force the dtype of the underlying integer mapping array. Must be a signed integer dtype. By default, the constructor uses the smallest dtype based on the number of unique categories or the maximum value provided in a mapping.

  • unicode (bool, default False) – By default, the array of unique categories is stored as byte strings. Set to True to store as unicode strings.

  • invalid (str, optional) – Specify a value in values to be treated as an invalid category. Note: Invalid categories are not excluded from aggregations; use filter instead. Warning: If the invalid category isn’t included in categories and a filter is used, the invalid category becomes Filtered.

  • auto_add (bool, default False) – Warning: Until a known issue is fixed, adding categories can have unexpected results. Intended behavior: When set to True, categories that do not exist in the unique categories can be added using category_add.

  • from_matlab (bool, default False) – Set to True to convert a Matlab Categorical. The float indices are converted to an integer type. To preserve invalid values, only base-1 indexing is supported.

See also

Accum2

Class for multi-key aggregations with summary data displayed.

Categorical._fa

Return the array of integer category mapping codes that corresponds to the array of Categorical values.

Categorical.category_array

Return the array of unique categories of a Categorical.

Categorical.category_dict

Return a dictionary of the unique categories.

Categorical.category_mapping

Return a dictionary of the integer category mapping codes for a Categorical created with an IntEnum or a mapping dictionary.

Categorical.base_index

See the base index of a Categorical.

Categorical.isnan

See which Categorical category is invalid.

Examples

A single-key Categorical created from a list of strings:

>>> c = rt.Categorical(["b", "a", "b", "a", "c", "c", "b"])
Categorical([b, a, b, a, c, c, b]) Length: 7
  FastArray([2, 1, 2, 1, 3, 3, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

A Categorical created from list of non-unique string values and a list of unique category strings. All values must appear in the provided categories, otherwise an error is raised:

>>> rt.Categorical(["b", "a", "b", "c", "a", "c", "c", "c"], categories=["b", "a", "c"])
Categorical([b, a, b, c, a, c, c, c]) Length: 8
  FastArray([1, 2, 1, 3, 2, 3, 3, 3], dtype=int8) Base Index: 1
  FastArray([b'b', b'a', b'c'], dtype='|S1') Unique count: 3

A Categorical created from a list of integers that index into a list of unique strings. The integers are used for the mapping array. Note that 0 becomes Filtered:

>>> rt.Categorical([0, 1, 1, 0, 2, 1, 2], categories=["c", "a", "b"])
Categorical([Filtered, c, c, Filtered, a, c, a]) Length: 7
  FastArray([0, 1, 1, 0, 2, 1, 2]) Base Index: 1
  FastArray([b'c', b'a', b'b'], dtype='|S1') Unique count: 3

If integers are provided with no categories and 0 is included, the integer mapping codes are incremented by 1 so that 0 is not Filtered:

>>> rt.Categorical([0, 1, 1, 0, 2, 1, 2])
Categorical([0, 1, 1, 0, 2, 1, 2]) Length: 7
  FastArray([1, 2, 2, 1, 3, 2, 3], dtype=int8) Base Index: 1
  FastArray([0, 1, 2]) Unique count: 3

Use from_matlab=True to create a Categorical from Matlab data. The float indices are converted to an integer type. To preserve invalid values, only base-1 indexing is supported:

>>> rt.Categorical([0.0, 1.0, 2.0, 3.0, 1.0, 1.0], categories=["b", "c", "a"], from_matlab=True)
Categorical([Filtered, b, c, a, b, b]) Length: 6
  FastArray([0, 1, 2, 3, 1, 1], dtype=int8) Base Index: 1
  FastArray([b'b', b'c', b'a'], dtype='|S1') Unique count: 3

A Categorical created from a Pandas Categorical with an invalid value:

>>> import pandas as pd
>>> pdc = pd.Categorical(["a", "a", "z", "b", "c"], ["c", "b", "a"])
>>> pdc
['a', 'a', NaN, 'b', 'c']
Categories (3, object): ['c', 'b', 'a']
>>> rt.Categorical(pdc)
Categorical([a, a, Filtered, b, c]) Length: 5
  FastArray([3, 3, 0, 2, 1], dtype=int8) Base Index: 1
  FastArray([b'c', b'b', b'a'], dtype='|S1') Unique count: 3

A Categorical created from a Python dictionary of strings to integers. The dictionary is provided as the categories argument, with a list of the mapping codes provided as the first argument:

>>> d = {"StronglyAgree": 44, "Agree": 133, "Disagree": 75, "StronglyDisagree": 1, "NeitherAgreeNorDisagree": 144 }
>>> codes = [1, 44, 44, 133, 75]
>>> rt.Categorical(codes, categories=d)
Categorical([StronglyDisagree, StronglyAgree, StronglyAgree, Agree, Disagree]) Length: 5
  FastArray([  1,  44,  44, 133,  75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 4

A Categorical created using the categories of another Categorical:

>>> c = rt.Categorical(["a", "a", "b", "a", "c", "c", "b"], categories=["c", "b", "a"])
>>> c.category_array
FastArray([b'c', b'b', b'a'], dtype='|S1')
>>> c2 = rt.Categorical(["b", "c", "c", "b"], categories=c.category_array)
>>> c2
Categorical([b, c, c, b]) Length: 4
  FastArray([2, 1, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'c', b'b', b'a'], dtype='|S1') Unique count: 3

Multi-key Categoricals let you create and operate on groupings based on multiple associated categories:

>>> strs = rt.FastArray(["a", "b", "b", "a", "b", "a"])
>>> ints = rt.FastArray([2, 1, 1, 2, 1, 3])
>>> c = rt.Categorical([strs, ints]) # Create with a list of arrays.
>>> c
Categorical([(a, 2), (b, 1), (b, 1), (a, 2), (b, 1), (a, 3)]) Length: 6
  FastArray([1, 2, 2, 1, 2, 3], dtype=int8) Base Index: 1
  {'key_0': FastArray([b'a', b'b', b'a'], dtype='|S1'), 'key_1': FastArray([2, 1, 3])} Unique count: 3
>>> c.count()
*key_0   *key_1   Count
------   ------   -----
a             2       2
b             1       3
a             3       1

[3 rows x 3 columns] total bytes: 27.0 B
property _categories
property _fa: riptable.rt_fastarray.FastArray

Return the array of integer category mapping codes that corresponds to the array of Categorical values.

Returns:

A FastArray of the integer category mapping codes of the Categorical.

Return type:

FastArray

See also

Categorical.category_array

Return the array of unique categories of a Categorical.

Categorical.categories

Return the unique categories of a single-key or multi-key Categorical, prepended with the ‘Filtered’ category.

Categorical.category_dict

Return a dictionary of the unique categories.

Categorical.category_mapping

Return a dictionary of the integer category mapping codes for a Categorical created with an IntEnum or a mapping dictionary.

Examples

Single-key string Categorical:

>>> c = rt.Categorical(['a','a','b','c','a'])
>>> c
Categorical([a, a, b, c, a]) Length: 5
  FastArray([1, 1, 2, 3, 1], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c._fa
FastArray([1, 1, 2, 3, 1], dtype=int8)

Multi-key Categorical:

>>> c2 = rt.Categorical([rt.FA([1, 2, 3, 3, 3, 1]), rt.FA(['a','b','c','c','c','a'])])
>>> c2
Categorical([(1, a), (2, b), (3, c), (3, c), (3, c), (1, a)]) Length: 6
  FastArray([1, 2, 3, 3, 3, 1], dtype=int8) Base Index: 1
  {'key_0': FastArray([1, 2, 3]), 'key_1': FastArray([b'a', b'b', b'c'], dtype='|S1')} Unique count: 3
>>> c2._fa
FastArray([1, 2, 3, 3, 3, 1], dtype=int8)

A Categorical constructed with an IntEnum or a mapping dictionary returns the provided integer category mapping codes:

>>> log_levels = {10: "DEBUG", 20: "INFO", 30: "WARNING", 40: "ERROR", 50: "CRITICAL"}
>>> c3 = rt.Categorical([10, 10, 40, 0, 50, 10, 30], log_levels)
>>> c3
Categorical([DEBUG, DEBUG, ERROR, !<0>, CRITICAL, DEBUG, WARNING]) Length: 7
  FastArray([10, 10, 40,  0, 50, 10, 30]) Base Index: None
  {10:'DEBUG', 20:'INFO', 30:'WARNING', 40:'ERROR', 50:'CRITICAL'} Unique count: 5
>>> c3._fa
FastArray([10, 10, 40,  0, 50, 10, 30])

A ‘Filtered’ category is mapped to 0 in the integer array:

>>> c4 = rt.Categorical(['b','b','c','d','e','b','c'])
>>> c4
Categorical([b, b, c, d, e, b, c]) Length: 7
  FastArray([1, 1, 2, 3, 4, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'b', b'c', b'd', b'e'], dtype='|S1') Unique count: 4
>>> c4._fa
FastArray([1, 1, 2, 3, 4, 1, 2], dtype=int8)
>>> c4.category_remove('c')  # A removed category becomes 'Filtered'.
>>> c4
Categorical([b, b, Filtered, d, e, b, Filtered]) Length: 7
  FastArray([1, 1, 0, 2, 3, 1, 0], dtype=int8) Base Index: 1
  FastArray([b'b', b'c', b'd', b'e'], dtype='|S1') Unique count: 4
>>> c4._fa
FastArray([1, 1, 0, 2, 3, 1, 0], dtype=int8)
property _total_size: int

Returns total size in bytes of Categorical’s Index FastArray and category array(s).

property as_string_array: riptable.rt_fastarray.FastArray

Return the full list of values of a Categorical as a string array.

For multi-key Categorical objects, the corresponding keys are concatenated with a “_” separator.

Filtered values become the string “Filtered”. Values from invalid categories are treated the same way as values from valid categories.

NOTE: This routine is costly because it re-expands the full list of values as strings.

Returns:

A FastArray of the string values of the Categorical.

Return type:

rt_fastarray.FastArray

See also

rt_categorical.Categorical.expand_array()

Return the full list of Categorical values.

Notes

This method works by applying an index mask to the unique categories.

Examples

Single-key string Categorical:

>>> c = rt.Categorical(["AAPL", "MSFT", "AAPL", "TSLA", "MSFT", "TSLA", "AAPL"])
>>> c
Categorical([AAPL, MSFT, AAPL, TSLA, MSFT, TSLA, AAPL]) Length: 7
  FastArray([1, 2, 1, 3, 2, 3, 1], dtype=int8) Base Index: 1
  FastArray([b'AAPL', b'MSFT', b'TSLA'], dtype='|S4') Unique count: 3
>>> c.as_string_array
FastArray([b'AAPL', b'MSFT', b'AAPL', b'TSLA', b'MSFT', b'TSLA', b'AAPL'], dtype='|S8')

Single-key integer Categorical:

>>> c = rt.Categorical([1, 2, 1, 1, 3, 2, 3])
>>> c.as_string_array
FastArray([b'1', b'2', b'1', b'1', b'3', b'2', b'3'], dtype='|S21')

Multi-key Categorical:

>>> key1 = rt.FastArray(["AAPL", "MSFT", "AAPL", "TSLA", "MSFT", "TSLA", "AAPL"])
>>> key2 = rt.FastArray([1, 1, 2, 2, 3, 3, 4])
>>> mk_cat = rt.Categorical([key1, key2])
>>> mk_cat
Categorical([(AAPL, 1), (MSFT, 1), (AAPL, 2), (TSLA, 2), (MSFT, 3), (TSLA, 3), (AAPL, 4)]) Length: 7
  FastArray([1, 2, 3, 4, 5, 6, 7], dtype=int8) Base Index: 1
  {'key_0': FastArray([b'AAPL', b'MSFT', b'AAPL', b'TSLA', b'MSFT', b'TSLA', b'AAPL'], dtype='|S4'), 'key_1': FastArray([1, 1, 2, 2,
3, 3, 4])} Unique count: 7
>>> mk_cat.as_string_array
FastArray([b'AAPL_1', b'MSFT_1', b'AAPL_2', b'TSLA_2', b'MSFT_3',
           b'TSLA_3', b'AAPL_4'], dtype='|S26')
property base_index: enum.IntEnum
property category_array: riptable.rt_fastarray.FastArray

Return the array of unique categories of a Categorical.

Unlike Categorical.categories, this method does not prepend the ‘Filtered’ category to the returned array.

Raises an error for multi-key Categorical objects. To get the categories of a multi-key Categorical, use Categorical.categories.

Returns:

A FastArray of the unique categories of the Categorical.

Return type:

FastArray

See also

Categorical._fa

Return the array of integer category mapping codes that corresponds to the array of Categorical values.

Categorical.categories

Return the unique categories of a single-key or multi-key Categorical, prepended with the ‘Filtered’ category.

Categorical.category_dict

Return a dictionary of the unique categories.

Categorical.category_mapping

Return a dictionary of the integer category mapping codes for a Categorical created with an IntEnum or a mapping dictionary.

Examples

Single-key string Categorical:

>>> c = rt.Categorical(['a','a','b','c','a'])
>>> c
Categorical([a, a, b, c, a]) Length: 5
  FastArray([1, 1, 2, 3, 1], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.category_array
FastArray([b'a', b'b', b'c'], dtype='|S1')

Single-key integer Categorical:

>>> c2 = rt.Categorical([4, 5, 4, 4, 6, 5, 6])
>>> c2
Categorical([4, 5, 4, 4, 6, 5, 6]) Length: 7
  FastArray([1, 2, 1, 1, 3, 2, 3], dtype=int8) Base Index: 1
  FastArray([4, 5, 6]) Unique count: 3
>>> c2.category_array
FastArray([4, 5, 6])

Single-key integer Categorical with categories provided:

>>> c3 = rt.Categorical([2, 3, 4, 2, 3, 4], categories=['a', 'b', 'c', 'd', 'e'])
>>> c3
Categorical([b, c, d, b, c, d]) Length: 6
  FastArray([2, 3, 4, 2, 3, 4]) Base Index: 1
  FastArray([b'a', b'b', b'c', b'd', b'e'], dtype='|S1') Unique count: 5
>>> c3.category_array
FastArray([b'a', b'b', b'c', b'd', b'e'], dtype='|S1')

The ‘Filtered’ category isn’t included:

>>> c4 = rt.Categorical([0, 1, 1, 0, 2, 1, 1, 1, 2, 0], categories=['a', 'b', 'c'])
>>> c4
Categorical([Filtered, a, a, Filtered, b, a, a, a, b, Filtered]) Length: 10
  FastArray([0, 1, 1, 0, 2, 1, 1, 1, 2, 0]) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c4.category_array
FastArray([b'a', b'b', b'c'], dtype='|S1')

A Categorical constructed with an IntEnum or a mapping dictionary returns the provided string categories:

>>> log_levels = {10: "DEBUG", 20: "INFO", 30: "WARNING", 40: "ERROR", 50: "CRITICAL"}
>>> c5 = rt.Categorical([10, 10, 40, 0, 50, 10, 30], log_levels)
>>> c5
Categorical([DEBUG, DEBUG, ERROR, !<0>, CRITICAL, DEBUG, WARNING]) Length: 7
  FastArray([10, 10, 40,  0, 50, 10, 30]) Base Index: None
  {10:'DEBUG', 20:'INFO', 30:'WARNING', 40:'ERROR', 50:'CRITICAL'} Unique count: 5
>>> c5.category_array
FastArray([b'DEBUG', b'INFO', b'WARNING', b'ERROR', b'CRITICAL'],
        dtype='|S8')
property category_codes: riptable.rt_fastarray.FastArray
property category_dict: Mapping[str, riptable.rt_fastarray.FastArray]

When possible, returns the dictionary of stored unique categories, otherwise raises an error.

Unlike the default for categories(), this will not prepend the invalid category to each array.

property category_mapping: dict
property category_mode: riptable.rt_enum.CategoryMode

Returns the category mode of the Categorical’s Categories object. List modes are when the categorical has gone through the unique/mbget process of binning. Dict modes are when the categorical was constructed with a dictionary mapping or IntEnum. Grouping mode is when the categorical was binned with the groupby hash (numeric list, multikey, etc.)

Returns:

see CategoryMode in rt_enum.py

Return type:

IntEnum

property expand_array: numpy.ndarray | Tuple[numpy.ndarray, Ellipsis]

Return the full list of values of a Categorical.

If the Categorical is constructed with an IntEnum or a mapping dictionary, the integer mapping codes are returned.

Filtered Categorical values are returned as “Filtered” for string arrays or numeric sentinel values for numeric arrays.

Note that because the expansion constructs the complete list of values from the list of unique categories, it is an expensive operation.

Returns:

For single-key Categorical objects, a FastArray is returned. For multi-key Categorical objects, a tuple of FastArray objects is returned.

Return type:

FastArray or tuple of FastArray

Warns:

Performance warning – Will warn the user if a large Categorical (more than 100,000 items) is being re-expanded.

See also

Categorical.as_string_array

Return the full list of values of a Categorical as a string array.

Examples

Single-key Categorical:

>>> c = rt.Categorical(["a", "a", "b", "c", "a"])
>>> c.expand_array
FastArray([b'a', b'a', b'b', b'c', b'a'], dtype='|S3')

Multi-key Categorical:

>>> c = rt.Categorical([rt.FastArray(["a", "b", "c", "a"]), rt.FastArray([1, 2, 3, 1])])
>>> c.expand_array
(FastArray([b'a', b'b', b'c', b'a'], dtype='|S8'), FastArray([1, 2, 3, 1]))

For a Categorical constructed with an IntEnum or a mapping dictionary, the array of integer mapping codes (c._fa) is returned:

>>> c = rt.Categorical([2, 2, 2, 1, 3], {"a": 1, "b": 2, "c": 3})
>>> c
Categorical([b, b, b, a, c]) Length: 5
  FastArray([2, 2, 2, 1, 3]) Base Index: None
  {1:'a', 2:'b', 3:'c'} Unique count: 3
>>> c.expand_array
FastArray([2, 2, 2, 1, 3])
>>> c._fa
FastArray([2, 2, 2, 1, 3])

Filtered string Categorical values are returned as the string “Filtered”:

>>> a = rt.FastArray(["a", "c", "b", "b", "c", "a"])
>>> f = rt.FastArray([False, False, True, True, True, True])
>>> c = rt.Categorical(a, filter=f)
>>> c
Categorical([Filtered, Filtered, b, b, c, a]) Length: 6
  FastArray([0, 0, 2, 2, 3, 1], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.expand_array
FastArray([b'Filtered', b'Filtered', b'b', b'b', b'c', b'a'], dtype='|S8')

Filtered integer Categorical values are returned as the integer sentinel value:

>>> a = rt.FastArray([1, 3, 2, 2, 3, 1])
>>> f = rt.FastArray([False, False, True, True, True, True])
>>> c = rt.Categorical(a, filter=f)
>>> c
Categorical([Filtered, Filtered, 2, 2, 3, 1]) Length: 6
  FastArray([0, 0, 2, 2, 3, 1], dtype=int8) Base Index: 1
  FastArray([1, 2, 3]) Unique count: 3
>>> c.expand_array
FastArray([-2147483648, -2147483648,           2,           2,
             3,           1])
property expand_dict: Dict[str, riptable.rt_fastarray.FastArray]

returns: A dictionary of expanded single or multikey columns. :rtype: dict

Notes

Will warn the user if a large categorical ( > 100,000 items ) is being re-expanded.

Examples

>>> c = rt.Categorical([FA(['a','a','b','c','a']), rt.arange(5)])
>>> c.expand_dict
{'key_0': FastArray([b'a', b'a', b'b', b'c', b'a'], dtype='|S3'),
 'key_1': FastArray([0, 1, 2, 3, 4])}
property filtered_name: str

Item displayed when a 0 bin is encountered. Will be omitted from groupby results by default.

property filtered_string
property gb_keychain
property groupby_data

All GroupByOps objects can hold a default dataset to perform operations on. GroupBy always holds a dataset. Categorical and Accum2 do not.

Examples

By default, requires data to be passed:

>>> c = rt.Categorical(['a','b','c'])
>>> c.sum()
ValueError: Useable data has not been specified in (). Pass in array data to operate on.

After the result of a Dataset.cat() operation, groupby data is set.

>>> ds = rt.Dataset({'groups':np.random.choice(['a','b','c'],10), 'data': rt.arange(10), 'data2': rt.arange(10)})
>>> ds
#   groups   data   data2
-   ------   ----   -----
0   a           0       0
1   a           1       1
2   c           2       2
3   c           3       3
4   a           4       4
5   a           5       5
6   c           6       6
7   b           7       7
8   c           8       8
9   a           9       9
>>> c = ds.cat('groups')
>>> c.sum()
*groups   data   data2
-------   ----   -----
a           19      19
b            7       7
c           19      19
property grouping

Grouping object that is called to perform calculations on grouped data. In the constructor, a grouping object provides a categorical with its instance array. The grouping object stores and generates other groupby information, like grouping indices, first occurrence, count, etc. The grouping object should be queried for all grouping-related properties. This is also a property in GroupBy, and is called by many routines in the GroupByOps parent class.

See Also: Grouping

property grouping_dict

Grouping dict held by Grouping object. May trigger lazy build of Grouping object.

property ifirstkey

Index of first occurrence of each unique key. May also trigger lazy evaluation of grouping object. If grouping object used the Groupby hash, it will have an iFirstKey array, otherwise returns None.

property ikey

Returns the grouping object’s iKey. This will always be a 1-base index, and is often the same array as the Categorical. See also: grouping.ikey (may return base 0 index)

property ilastkey

Index of last occurrence of each unique key. May also trigger lazy evaluation of grouping object. If grouping object used the Groupby hash, it will have an iLastKey array, otherwise returns None.

property invalid_category

The Categorical object’s invalid category.

An invalid category is specified when the Categorical is created or set afterward using Categorical.invalid_set. An invalid category is different from a Filtered category or a NaN value.

Returns:

The invalid category of the Categorical. Returns None if there’s no invalid category.

Return type:

str or int or float or None

See also

Categorical.filtered_name

Item displayed when a 0 bin is encountered in a Categorical.

Categorical.isnan

Find the invalid elements of a Categorical.

Categorical.isnotnan

Find the valid elements of a Categorical.

Examples

>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b")
>>> c
Categorical([b, a, c, b, c]) Length: 5
  FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.invalid_category
'b'
>>> c.isnan()  # Returns True for invalid category.
FastArray([ True, False, False,  True, False])

Invalid categories are different from Filtered categories:

>>> f = rt.FA([False, True, True, False, True])
>>> c2 = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="a", filter=f)
>>> c2
Categorical([Filtered, a, c, Filtered, c]) Length: 5
  FastArray([0, 1, 2, 0, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'c'], dtype='|S1') Unique count: 2
>>> c2.invalid_category
'a'
>>> c2.isnan()  # Show which values are in the invalid category.
FastArray([False,  True, False, False, False])
>>> c2.isfiltered()  # Show which values are Filtered.
FastArray([ True, False, False,  True, False])

Invalid categories in a Categorical are different from regular integer NaN values. An integer NaN is a valid category and is False for Cat.isnan():

>>> a = rt.FA([1, 2, 3, 4])
>>> a[3] = a.inv  # Set the last value to an integer NaN.
>>> a
FastArray([          1,           2,           3, -2147483648])
>>> c3 = rt.Categorical(values=a, invalid=2)  # Make 2 an invalid category.
>>> c3
Categorical([1, 2, 3, -2147483648]) Length: 4
  FastArray([2, 3, 4, 1], dtype=int8) Base Index: 1
  FastArray([-2147483648,           1,           2,           3]) Unique count: 4
>>> c3.invalid_category()
2
>>> c3.isnan()  # Only the invalid category returns True for Cat.isnan.
FastArray([False,  True, False, False])
>>> c3.expand_array.isnan()  # Only the integer NaN returns True for FA.isnan.
FastArray([False, False, False,  True])
property isenum: bool

See Categories.enum

property ismultikey: bool

See Categories.multikey

property issinglekey: bool

See Categories.singlekey

property nan_index: int
property ordered: bool

If the categorical is tagged as ordered, the unique categories will remain in the order they were provided in.

ordered is also true if a sort was performed when generating the unique categories.

property sort_gb: bool
property sorted: bool

If the categorical is tagged as sorted, it can use a binary search when performing a lookup in the unique categories.

If a sorted groupby operation is performed, no sort will need to be applied.

property transform

TO BE DEPRECATED

Examples

>>> c = rt.Categorical(ds.symbol)
>>> c.transform.sum(ds.TradeSize)
property unique_count

Number of unique values in the categorical. It is necessary for every groupby operation.

Notes

For categoricals in dict / enum mode that have generated their grouping object, this will reflect the number of unique values that occur in the non-unique values. Empty bins will not be included in the count.

property unique_repr
DebugMode = False
GroupingDebugMode = False
MetaDefault
MetaVersion = 1
TestIsMemberVerbose = False
_test_cat_ismember = ''
__arrow_array__(type=None)[source]

Implementation of the __arrow_array__ protocol for conversion to a pyarrow array.

Parameters:

type (pyarrow.DataType, optional, defaults to None) –

Return type:

pyarrow.Array or pyarrow.ChunkedArray

Notes

https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol

__del__()[source]

Called when a Categorical is deleted.

__eq__(other)[source]

Return self==value.

__ge__(other)[source]

Return self>=value.

__getitem__(fld)[source]

Indexing: Bracket indexing for Categoricals will always hit the FastArray of indices/codes first. If indexed by integer, the retrieved index or code will be passed to the Categories object so the corresponding Category can be returned. Otherwise, a new Categorical will be returned, using the same Categories as the original Categorical with a different index/code array.

The following examples will use this Categorical:

>>> c = rt.Categorical(['a','a','a','b','c','a','b'])
>>> c
Categorical([a, a, a, b, c, a, b]) Length: 7
  FastArray([1, 1, 1, 2, 3, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Single Integer:

For convenience, any bytestrings will be returned/displayed as unicode strings.

>>> c[3]
'b'

Multiple Integers:

>>> c[[1,2,3,4]]
Categorical([a, a, b, c]) Length: 4
  FastArray([1, 1, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c[np.arange(4,6)]
Categorical([c, a]) Length: 2
  FastArray([3, 1], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Boolean Array:

>>> mask = FastArray([False,  True,  True,  True,  True,  True, False])
>>> c[mask]
Categorical([a, a, b, c, a]) Length: 5
  FastArray([1, 1, 2, 3, 1], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Slice:

>>> c[2:5]
Categorical([a, b, c]) Length: 3
  FastArray([1, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
__gt__(other)[source]

Return self>value.

__le__(other)[source]

Return self<=value.

__lt__(other)[source]

Return self<value.

__ne__(other)[source]

Return self!=value.

__repr__(verbose=False)[source]

Return repr(self).

__setitem2__(key, value)[source]

Use grouping object isin, single item accessor instead of Categories object.

__setitem__(index, value)[source]
Parameters:
  • index (int or string (depends on category mode)) –

  • value (sequence or scalar value) – The value may represent a category or category index.

Raises:

IndexError

__str__()[source]

Return str(self).

static _array_compiled_numba_apply(iGroup, iFirstGroup, nCountGroup, userfunc, args)[source]
_as_meta_data(name=None)[source]
Parameters:

name (string, optional) – If not specified, will attempt to get name with get_name(), otherwise use class name.

Returns:

  • arrdict (dictionary) – Dictionary of column names -> arrays. Extra columns (for unique categories) will have the name+’!’ before their keys.

  • arrtypes (list) – List of SDSFlags, same length as arrdict.

  • meta (json-encoded string) – Meta data for the categorical.

See also

_from_meta_data

_attach_self_as_key_column(result)[source]
_autocomplete()[source]
_build_sds_meta_data(name, **kwargs)[source]

Generates meta data from calling categorical, assembles arrays to represent its unique categories.

Parameters:

name (name of the categorical in the calling structure, or Categorical by default) –

Returns:

  • meta (MetaData) – Metadata object for final save

  • cols (list of FastArray) – arrays to represent unique categories - regardless of CategoryMode

  • tups (tuples with names of addtl. cols - still determining enum for second item in tuple (will relate to multiday load/concatenation)) – names will be in the format ‘name!col_’ followed by column number

_build_string()[source]
_calculate_all(funcNum, *args, func_param=0, **kwargs)[source]
_categorical_compare_check(func_name, other)[source]

Converts a category to a valid index for faster logical comparison operations on the underlying index fastarray.

_category_make_unique_multi_key()[source]

Remove duplicated categories by replacing categories with the unique set and remapping codes. Gets out early if categories are already unique.

_copy_extra(cat_copy)[source]

Internal routine to move over some extra data from self

_expand_array(arr, index=None)[source]

Internal routine to h-stack an invalid with an array for re-expanding single or multikey categoricals. This allows invalids to be retained in the re-expanded array(s)

static _from_arrow(arr, zero_copy_only=True, writable=False)[source]

Create a Categorical instance from a dictionary-encoded pyarrow.Array.

For certain special cases, namely CategoryMode.IntEnum, CategoryMode.Dictionary, and CategoryMode.MultiKey, this method accepts an instance of pyarrow.Table, since Categorical instances with these CategoryMode`s don't have an encoding in pyarrow that'd directly preserve their structure. (For example, the direct mapping between the case labels and values for a `CategoryMode.IntEnum or CategoryMode.Dictionary-mode Categorical.)

Parameters:
  • arr (pyarrow.Array or pyarrow.ChunkedArray) – Must be a dictionary-encoded pyarrow array or a Struct-type array (e.g. pyarrow.StructArray).

  • zero_copy_only (bool, optional, defaults to True) –

  • writable (bool, optional, defaults to False) –

Return type:

Categorical

classmethod _from_maybe_non_unique_labels(values, categories, base_index=1)[source]

Remove duplicated categories by replacing categories with the unique set and remapping codes. Gets out early if categories are already unique.

classmethod _from_meta_data(arrdict, arrflags, meta)[source]
_getsingleitem(fld)[source]

If the getitem indexing operation returned a scalar, translate it according to how the uniques are being held.

Return type:

Scalar or tuple based on unique type.

_ipython_key_completions_()[source]

For tab completions with bracket indexing (__getitem__) The IPython completer needs a python list or dict keys/values. If no return (e.g. multikey categorical), return an empty list. Also returns empty if categorical has > 10_000 unique values. If an IPython environment is detected, the ‘greedy’ property is set to True in riptable’s __init__

classmethod _load_from_sds_meta_data(name, arr, cols, meta)[source]

Builds a categorical object from metadata and arrays.

Will translate metadata, array/column layout from older versions to be compatible with current loader. Raises an error if the metadata version is higher than the class’s meta version (user will need to update riptable)

Parameters:
  • name (item's name in the calling container, or the classname Categorical by default) –

  • arr (the underlying index array for the categorical) –

  • cols (additional arrays to rebuild unique categories) –

  • meta (meta data generated by build_sds_meta_data() routine) –

Returns:

Reconstructed categorical object.

Return type:

Categorical

Examples

>>> m = y._build_sds_meta_data('y')
>>> rt.Categorical._load_from_sds_meta_data('y', y._fa, m[1], m[0])
_meta_dict(name=None)[source]
_nan_idx()[source]

Internal - for isnan, isnotnan

_nanfunc(func, fillval)[source]
_prepend_invalid(arr)[source]

For base index 1 categoricals, add the invalid category to the beginning of the array of unique categories.

Parameters:

arr (FastArray) – The array holding the unique category values for this Categorical. This array may be a FastArray or a subclass of FastArray.

Returns:

An array of the same type as arr whose length is len(arr) + 1, where the first (0th) element of the array is the invalid value for that array type.

Return type:

FastArray

static _scalar_compiled_numba_apply(iGroup, iFirstGroup, nCountGroup, userfunc, args)[source]
_tf_spacer(tf_string)[source]
static _transformed_scalar_compiled_numba_apply(iGroup, iFirstGroup, nCountGroup, userfunc, args)[source]
classmethod align(cats)[source]

Cats must be a list of categoricals. The unique categories will be merged into a new unique list. The indices will be fixed to point to the new category array.

Return type:

A list of (possibly) new categoricals which share the same categories (and thus bin numbering).

Examples

>>> c1 = rt.Categorical(['a','b','c'])
>>> c2 = rt.Categorical(['d','e','f'])
>>> c3 = rt.Categorical(['c','f','z'])
>>> rt.Categorical.align([c1,c2,c3])
[Categorical([a, b, c]) Length: 3
  FastArray([1, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c', b'd', b'e', b'f', b'z'], dtype='|S1') Unique count: 7
Categorical([d, e, f]) Length: 3
  FastArray([4, 5, 6], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c', b'd', b'e', b'f', b'z'], dtype='|S1') Unique count: 7
Categorical([c, f, z]) Length: 3
  FastArray([3, 6, 7], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c', b'd', b'e', b'f', b'z'], dtype='|S1') Unique count: 7]
apply(userfunc=None, *args, dataset=None, **kwargs)[source]

See Grouping.apply for examples. Categorical needs remove unused bins from its uniques before an apply.

apply_nonreduce(userfunc=None, *args, dataset=None, **kwargs)[source]

See GroupByOps.apply_nonreduce for examples. Categorical needs remove unused bins from its uniques before an apply.

argsort()[source]
as_singlekey(ordered=False, sep='_')[source]

Normalizes categoricals by returning a base 1 single key categorical.

Enum or dict based categoricals will be converted to single key categoricals. Multikey categoricals will be converted to single key categoricals. If the categorical is already single key, base 0 it will be returned as base 1. If the categorical is already single key, base 1 it will be returned as is.

Parameters:
  • ordered (bool, defaults False) – whether or not to sort the result

  • sep (char, defaults ='_') – only valid for multikey since this is the multikey separator

Examples

>>> c=rt.Cat([5, -3, 7], {-3:'one', 2:'two', 5: 'three', 7:'four'})
>>> d=c.as_singlekey()
>>> c._fa
FastArray([ 5, -3,  7])
>>> d._fa
FastArray([3, 2, 1], dtype=int8)
Return type:

A single key base 1 categorical.

auto_add_off()[source]

Sets the _auto_add_categories flag to False. Category assignment with a non-existing categorical will raise an error.

Examples

>>> c = rt.Categorical(['a','a','b','c','a'], auto_add_categories=True)
>>> c._categories
FastArray([b'a', b'b', b'c'], dtype='|S1')
>>> c.auto_add_off()
>>> c[0] = 'z'
ValueError: Cannot automatically add categories [b'z'] while auto_add_categories is set to False.
auto_add_on()[source]

If the categorical is unlocked, this sets the _auto_add_categories flag to be True. If _auto_add_categories is set to False, the following assignment will raise an error. If the categorical is locked, auto_add_on() will warn the user and the flag will not change.

Examples

>>> c = rt.Categorical(['a','a','b','c','a'])
>>> c._categories
FastArray([b'a', b'b', b'c'], dtype='|S1')
>>> c.auto_add_on()
>>> c[0] = 'z'
>>> print(c)
z, a, b, c, a
>>> c._categories
FastArray([b'a', b'b', b'c', b'z'], dtype='|S1')
categories(showfilter=True)[source]

If the categories are stored in a single array or single-key dictionary, an array will be returned. If the categories are stored in a multikey dictionary, a dictionary will be returned. If the categories are a mapping, a dictionary of the mapping will be returned (int -> string)

Note: you can also request categories in a certain format when possible using properties: category_array, category_dict, category_mapping.

Parameters:

showfilter (bool, defaults to True) – If True (default), the invalid category will be prepended to the returned array or multikey columns. Does not apply when mapping is returned.

Return type:

np.ndarray or dict

Examples

>>> c = rt.Categorical(['a','a','b','c','d'])
>>> c.categories()
FastArray([b'Inv', b'a', b'b', b'c', b'd'], dtype='|S1')
>>> c = rt.Categorical([rt.arange(3), rt.FA(['a','b','c'])])
>>> c.categories()
{'key_0': FastArray([-2147483648,           0,           1,           2]),
 'key_1': FastArray([b'Inv', b'a', b'b', b'c'], dtype='|S3')}
>>> c = rt.Categorical(rt.arange(3), {'a':0, 'b':1, 'c':2})
>>> c.categories()
{0: 'a', 1: 'b', 2: 'c'}
classmethod categories_equal(cats)[source]

Check if every Categorical or array has the same categories (same unique values in the same order).

Parameters:

cats (list of Categorical or np.ndarray or tuple of np.ndarray) – cats must be a list of Categorical objects or arrays that can be converted to Categorical objects.

Returns:

  • match (bool) – True if every Categorical has the same categories (same unique values in same order), otherwise False.

  • fixed_cats (list of Categorical) – List of Categorical objects which may have been fixed up.

Notes

TODO: Can the type annotation for cats be relaxed to Collection instead of List?

category_add(value)[source]

New category will always be added to the end of the category array.

category_make_unique()[source]

Remove duplicated categories by replacing categories with the unique set and remapping codes. Gets out early if categories are already unique.

category_remove(value)[source]

Performance may suffer as indices need to be fixed up. All previous matches to the removed category will be flipped to invalid.

category_replace(value, new_value)[source]
copy(categories=None, ordered=None, sort_gb=None, lex=None, base_index=None, filter=None, dtype=None, unicode=None, invalid=None, auto_add=False, from_matlab=False, _from_categorical=None, deep=True, order='K')[source]

Return a copy of the input FastArray.

Parameters:

order ({'K', 'C', 'F', 'A'}, default 'K') – Controls the memory layout of the copy: ‘K’ means match the layout of the input array as closely as possible; ‘C’ means row-based (C-style) order; ‘F’ means column-based (Fortran-style) order; ‘A’ means ‘F’ if the input array is formatted as ‘F’, ‘C’ if not.

Returns:

A copy of the input FastArray.

Return type:

FastArray

See also

Categorical.copy

Return a copy of the input Categorical.

Dataset.copy

Return a copy of the input Dataset.

Struct.copy

Return a copy of the input Struct.

Examples

Copy a FastArray:

>>> a = rt.FA([1, 2, 3, 4, 5])
>>> a
FastArray([1, 2, 3, 4, 5])
>>> a2 = a.copy()
>>> a2
FastArray([1, 2, 3, 4, 5])
>>> a2 is a
False  # The copy is a separate object.
copy_invalid()[source]

Return a copy of a FastArray filled with the invalid value for the array’s data type.

Returns:

A copy of the input array, filled with the invalid value for the array’s dtype.

Return type:

FastArray

See also

FastArray.inv

Return the invalid value for the input array’s dtype.

FastArray.fill_invalid

Replace the values of a FastArray with the invalid value for the array’s dtype.

Examples

Copy an integer array and replace with invalids:

>>> a = rt.FA([1, 2, 3, 4, 5])
>>> a
FastArray([1, 2, 3, 4, 5])
>>> a2 = a.copy_invalid()
>>> a2
FastArray([-2147483648, -2147483648, -2147483648, -2147483648,
           -2147483648])
>>> a
FastArray([1, 2, 3, 4, 5])  # a is unchanged.

Copy a floating-point array and replace with invalids:

>>> a3 = rt.FA([0., 1., 2., 3., 4.])
>>> a3
FastArray([0., 1., 2., 3., 4.])
>>> a3.copy_invalid()
FastArray([nan, nan, nan, nan, nan])

Copy a string array and replace with invalids:

>>> a4 = rt.FA(['AMZN', 'IBM', 'MSFT', 'AAPL'])
>>> a4
FastArray([b'AMZN', b'IBM', b'MSFT', b'AAPL'], dtype='|S4')
>>> a4.copy_invalid()
FastArray([b'', b'', b'', b''], dtype='|S4')  # Invalid string value is an empty string.
count(filter=None, transform=False)[source]

Count the number of times each value appears in a Categorical.

Unlike other Categorical operations, this does not take a parameter for data.

Parameters:
  • filter (array of bool, optional) – Categorical values that correspond to False filter values are excluded from the count. The filter array must be the same length as the Categorical.

  • transform (bool, default False) – Set to True to return a Dataset that’s the length of the Categorical, with counts aligned to the ungrouped Categorical values. Only the counts are included.

Returns:

A Dataset containing each unique category and its count. If transform is True, the Dataset is the same length as the original Categorical and contains only the counts.

Return type:

rt_dataset.Dataset

See also

rt_grouping.Grouping.count()

Called by this method.

rt_categorical.Categorical.unique_count()

Return the number of unique values in a Categorical.

rt_fastarray.FastArray.count()

Return the unique values of a FastArray and their counts.

Examples

Create a Categorical and count its values:

>>> c = rt.Categorical(["a", "a", "b", "c", "a", "c"])
>>> c
Categorical([a, a, b, c, a, c]) Length: 6
  FastArray([1, 1, 2, 3, 1, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.count()
*key_0   Count
------   -----
a            3
b            1
c            2

[3 rows x 2 columns] total bytes: 15.0 B

Filter based on Categorical values:

>>> f = (c == "a")
>>> c.count(filter=f)
*key_0   Count
------   -----
a            3
b            0
c            0

[3 rows x 2 columns] total bytes: 15.0 B

Filter based on a separate array of values:

>>> vals = rt.arange(6)
>>> f = (vals > 2)
>>> c.count(filter=f)
*key_0   Count
------   -----
a            1
b            0
c            2

[3 rows x 2 columns] total bytes: 15.0 B

With transform=True, a Dataset is returned with counts aligned to the ungrouped Categorical values:

>>> c.count(transform=True)
#   Count
-   -----
0       3
1       3
2       1
3       2
4       3
5       2

[6 rows x 1 columns] total bytes: 24.0 B
static display_convert_func(item, itemformat)[source]

Used in conjunction with display_query_properties for final display of a categorical in a dataset. Removes quotation marks from multikey categorical tuples so display is easier to read.

display_query_properties()[source]

Takes over display query properties for fastarray. By default, all categoricals will use left alignment.

expand_any(categories)[source]
Parameters:

categories (list or np.ndarray same size as categories array) –

Return type:

A re-expanded array of mapping categories passed in.

Examples

>>> c = rt.Categorical(['a','a','b','c','a'])
>>> c.expand_any(['d','e','f'])
FastArray(['d', 'd', 'e', 'f', 'd'], dtype='<U8')
fill_backward(*args, limit=0, fill_val=None, inplace=False)[source]

Replace NaN and invalid array values by propagating the next encountered valid group value backward.

Optionally, you can modify the original array if it’s not locked.

Parameters:
  • *args (array or list of arrays) – The array or arrays that contain NaN or invalid values you want to replace.

  • limit (int, default 0 (disabled)) – The maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled. If no limit is specified, all consecutive NaN and invalid values are replaced.

  • fill_val (scalar, default None) – The value to use where there is no valid group value to propagate backward. If fill_val is not specified, NaN and invalid values aren’t replaced where there is no valid group value to propagate backward.

  • inplace (bool, default False) – If False, return a copy of the array. If True, modify original data. This will modify any other views on this object. This fails if the array is locked.

Returns:

The Categorical will be the same size and have the same dtypes as the original input.

Return type:

Categorical

See also

Categorical.fill_forward

Replace NaN and invalid array values with the last valid group value.

GroupBy.fill_backward

Replace NaN and invalid array values with the next valid group value.

riptable.fill_backward

Replace NaN and invalid values with the next valid value.

Dataset.fillna

Replace NaN and invalid values with a specified value or nearby data.

FastArray.fillna

Replace NaN and invalid values with a specified value or nearby data.

Examples

>>> cat = rt.Categorical(['A', 'B', 'A', 'B', 'A', 'B'])
>>> x = rt.FA([rt.nan, rt.nan, 2, 3, 4, 5])
>>> cat.fill_backward(x)
*gb_key_0   col_0
---------   -----
A            2.00
B            3.00
A            2.00
B            3.00
A            4.00
B            5.00

Use a fill_val to replace values where there’s no valid group value to propagate backward:

>>> x = rt.FastArray([0, 1, 2, 3, rt.nan, rt.nan])
>>> cat.fill_backward(x, fill_val = 0)[0]
FastArray([0., 1., 2., 3., 0., 0.])

Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values in a group:

>>> x = rt.FastArray([rt.nan, rt.nan, rt.nan, rt.nan, 4, 5])
>>> cat.fill_backward(x, limit = 1)[0]
FastArray([nan, nan,  4.,  5.,  4.,  5.])
fill_forward(*args, limit=0, fill_val=None, inplace=False)[source]

Replace NaN and invalid array values by propagating the last encountered valid group value forward.

Optionally, you can modify the original array if it’s not locked.

Parameters:
  • *args (array or list of arrays) – The array or arrays that contain NaN or invalid values you want to replace.

  • limit (int, default 0 (disabled)) – The maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled. If no limit is specified, all consecutive NaN and invalid values are replaced.

  • fill_val (scalar, default None) – The value to use where there is no valid group value to propagate forward. If fill_val is not specified, NaN and invalid values aren’t replaced where there is no valid group value to propagate forward.

  • inplace (bool, default False) – If False, return a copy of the array. If True, modify original data. This will modify any other views on this object. This fails if the array is locked.

Returns:

The Categorical will be the same size and have the same dtypes as the original input.

Return type:

Categorical

See also

Categorical.fill_backward

Replace NaN and invalid array values with the next valid group value.

GroupBy.fill_forward

Replace NaN and invalid array values with the last valid group value.

riptable.fill_forward

Replace NaN and invalid values with the last valid value.

Dataset.fillna

Replace NaN and invalid values with a specified value or nearby data.

FastArray.fillna

Replace NaN and invalid values with a specified value or nearby data.

Examples

>>> cat = rt.Categorical(['A', 'B', 'A', 'B', 'A', 'B'])
>>> x = rt.FastArray([0, 1, 2, 3, rt.nan, rt.nan])
>>> cat.fill_forward(x)
*gb_key_0   col_0
---------   -----
A            0.00
B            1.00
A            2.00
B            3.00
A            2.00
B            3.00

Use a fill_val to replace values where there’s no valid group value to propagate forward:

>>> x = rt.FastArray([rt.nan, rt.nan, 2, 3, 4, 5])
>>> cat.fill_forward(x, fill_val = 0)[0]
FastArray([0., 0., 2., 3., 4., 5.])

Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values in a group:

>>> x = rt.FastArray([0, 1, rt.nan, rt.nan, rt.nan, rt.nan])
>>> cat.fill_forward(x, limit = 1)[0]
FastArray([ 0.,  1.,  0.,  1., nan, nan])
fill_invalid(shape=None, dtype=None, order=None, inplace=True)[source]

Returns a Categorical full of invalids, with reference to same categories. Must be base index 1.

filtered_set_name(name)[source]

Set the name or value that will be displayed for filtered categories. Default is FILTERED_LONG_NAME

from_bin(bin)[source]

Returns the category corresponding to a single integer. Raises error if index is out of range (accounts for base index) - or does not exist in mapping.

Notes

String values will appear as the scalar type they are stored in, however FastArray, Categorical, and other riptable routines will convert/compensate for unicode/bytestring mismatches.

Examples

Base-1 Indexing:

>>> c = rt.Categorical(['a','a','b','c','a'])
>>> c.category_array
FastArray([b'a', b'b', b'c'], dtype='|S1')
>>> c.category_from_bin(2)
b'b'
>>> c.category_from_bin(4)
IndexError

Base-0 Indexing:

>>> c = rt.Categorical(['a','a','b','c','a'], base_index=0)
>>> c.category_from_bin(2)
b'c'
from_category(category)[source]

Returns the bin associated with a category. If the category doesn’t exist, an error will be raised.

Note: the bin returned is the value as it appears in the underlying integer FastArray. It may not be a direct index into the stored unique categories.

Unicode/bytes conversion will be handled internally.

Examples

Single Key (base-1):

>>> c = rt.Categorical(['a','a','b','c','a'])
>>> c.bin_from_category('a')
1
>>> c = rt.Categorical(['a','a','b','c','a'])
>>> c.bin_from_category(b'c')
3

Single Key (base-0):

>>> c = rt.Categorical(['a','a','b','c','a'], base_index=0)
>>> c.bin_from_category('a')
0

Multikey:

>>> c = rt.Categorical([rt.FA(['a','b','c']), rt.arange(3)])
>>> c.bin_from_category(('a', 0))
1

Mapping:

>>> c = rt.Categorical([1,2,3], {'a':1, 'b':2, 'c':3})
>>> c.bin_from_category('c')
>>> 3

Numeric:

>>> c = rt.Categorical(rt.FA([3.33, 5.55, 6.66]))
>>> c.bin_from_category(3.33)
1
static full(size, value)[source]

Create a Categorical of a given length, filled with a single value.

Parameters:
  • size (int) – The size/length of the Categorical to create.

  • value – The value to be repeated.

Return type:

Categorical

Examples

Create a 1D Categorical array of length 100_000, filled with the string “example”.

>>> rt.Categorical.full(100_000, 'example')
Categorical([example, example, example, example, example, ..., example, example, example, example, example]) Length: 100000
  FastArray([1, 1, 1, 1, 1, ..., 1, 1, 1, 1, 1], dtype=int8) Base Index: 1
  FastArray([b'example'], dtype='|S7') Unique count: 1
groupby_data_clear()[source]

Remove any stored dataset for future groupby operations.

groupby_data_set(ds)[source]

Store data to apply future groupby operations to. This will make the categorical behave like a groupby object that was created from a dataset. If data is specified during an operation, it will be used instead of the stored dataset.

Parameters:

ds (Dataset) –

Examples

>>> c = rt.Categorical(['a','b','c','c','a','a'])
>>> a = np.arange(6)
>>> ds = rt.Dataset({'col':a})
>>> c.groupby_data_set(ds)
>>> c.sum()
*gb_key   col
-------   ---
a           9
b           1
c           5
groupby_reset()[source]

Resets all lazily evaluated groupby information. The categorical will go back to the state it was in just after construction. This is called any time the categories are modified.

classmethod hstack(cats)[source]

Cats must be a list of categoricals. The unique categories will be merged into a new unique list. The indices will be fixed to point to the new category array. The indices are hstacks and a new categorical is returned.

Examples

>>> c1 = rt.Categorical(['a','b','c'])
>>> c2 = rt.Categorical(['d','e','f'])
>>> combined = rt.Categorical.hstack([c1,c2])
>>> combined
Categorical([a, b, c, d, e, f]) Length: 6
  FastArray([1, 2, 3, 4, 5, 6]) Base Index: 1
  FastArray([b'a', b'b', b'c', b'd', b'e', b'f'], dtype='|S1') Unique count: 6
info()[source]

The three arrays in info: Categories mapped to their indices, often making the categorical appear to be a string array. Length of array. Underlying array of integer indices, dtype. Base index (normally 1 to reserve 0 as an invalid bin for groupby - much better for performance) Categories - list or dictionary

The CategoryMode is also displayed:

Mode:

Default - no example StringArray - categories are held in a single string array IntEnum - categories are held in a dictionary generated from an IntEnum Dictionary - categories are held in a dictionary generated from a code-mapping dictionary NumericArray - categories are held in a single numeric array MultiKey - categories are held in a dictionary (when constructed with multikey, or numeric categories the groupby hash does the binning)

Locked:

If True, categories may be changed.

invalid_set(inv)[source]

Set a Categorical category to be invalid.

An invalid category is specified when the Categorical is created or set afterward using Categorical.invalid_set. An invalid category is different from a Filtered category or a NaN value.

If there’s an existing invalid category in the Categorical, using Categorical.invalid_set to set a different category causes the existing invalid category to become valid.

Parameters:

inv (str or bytes) – The category to be made invalid.

Return type:

None

See also

Categorical.isnan

Find the invalid elements of a Categorical.

Categorical.isnotnan

Find the valid elements of a Categorical.

Categorical.invalid_category

The Categorical object’s invalid category.

Examples

>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"])
>>> c
Categorical([b, a, c, b, c]) Length: 5
  FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.invalid_set("b")
>>> c.invalid_category
'b'
>>> c.isnan()  # Returns True for invalid category.
FastArray([ True, False, False,  True, False])

Set a new invalid category:

>>> c.invalid_set("a")
>>> c.invalid_category
'a'
>>> c.isnan()
FastArray([False,  True, False, False, False])
isfiltered()[source]

True where bin == 0. Only applies to categoricals with base index 1, otherwise returns all False. Different than invalid category.

isin(values)[source]
Parameters:

values (a list-like or single value to be searched for) –

Returns:

Boolean array with the same size as self. True indicates that the array element occured in the provided values.

Return type:

FastArray

Notes

Behavior differs from pandas in the following ways: * Riptable favors bytestrings, and will make conversions from unicode/bytes to match for operations as necessary. * We also accept single scalars for values. * Pandas series will return another series - we have no series, and will return a FastArray.

Examples

>>> c = rt.Categorical(['a','b','c','d','e'], unicode=False)
>>> c.isin(['a','b'])
FastArray([ True,  True, False, False, False])

See also

pandas.Categorical.isin

isna(*args, **kwargs)[source]

See Categorical.isnan.

isnan(*args, **kwargs)[source]

Find the invalid elements of a Categorical.

An invalid category is specified when the Categorical is created or set afterward using Categorical.invalid_set. An invalid category is different from a Filtered category or a NaN value.

Returns:

A boolean array the length of the values array where True indicates an invalid Categorical category.

Return type:

FastArray

See also

Categorical.isnotnan

Find the valid elements of a Categorical.

Categorical.invalid_category

The Categorical object’s invalid category.

Categorical.invalid_set

Set a Categorical category to be invalid.

Examples

>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b")
>>> c
Categorical([b, a, c, b, c]) Length: 5
  FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.isnan()
FastArray([ True, False, False,  True, False])

Invalid categories are different from Filtered categories:

>>> f = rt.FA([True, False, True, True, True])
>>> c2 = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b", filter=f)
>>> c2
Categorical([b, Filtered, c, b, c]) Length: 5
  FastArray([1, 0, 2, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'b', b'c'], dtype='|S1') Unique count: 2
>>> c2.isnan()  # Only the invalid category returns True for Cat.isnan.
FastArray([ True, False, False,  True, False])
>>> c2.isfiltered()  # Only the Filtered value returns True for Cat.isfiltered.
FastArray([False,  True, False, False, False])

Invalid categories in a Categorical are different from regular integer NaN values. An integer NaN is a valid category and is False for Cat.isnan():

>>> a = rt.FA([1, 2, 3, 4])
>>> a[3] = a.inv  # Set the last value to an integer NaN.
>>> a
FastArray([          1,           2,           3, -2147483648])
>>> c3 = rt.Categorical(values=a, invalid=2)  # Make 2 an invalid category.
>>> c3
Categorical([1, 2, 3, -2147483648]) Length: 4
  FastArray([2, 3, 4, 1], dtype=int8) Base Index: 1
  FastArray([-2147483648,           1,           2,           3]) Unique count: 4
>>> c3.invalid_category()
2
>>> c3.isnan()  # Only the invalid category returns True for Cat.isnan.
FastArray([False,  True, False, False])
>>> c3.expand_array.isnan()  # Only the integer NaN returns True for FA.isnan.
FastArray([False, False, False,  True])
isnotnan(*args, **kwargs)[source]

Find the valid elements of a Categorical.

An invalid category is specified when the Categorical is created or set afterward using Categorical.invalid_set. An invalid category is different from a Filtered category or a NaN value.

Returns:

A boolean array the length of the values array where True indicates a valid Categorical category.

Return type:

FastArray

See also

Categorical.isnan

Find the invalid elements of a Categorical.

Categorical.invalid_category

The Categorical object’s invalid category.

Categorical.invalid_set

Set a Categorical category to be invalid.

Examples

>>> c = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b")
>>> c
Categorical([b, a, c, b, c]) Length: 5
  FastArray([2, 1, 3, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.isnotnan()
FastArray([False,  True,  True, False,  True])

Invalid categories are different from Filtered categories:

>>> f = rt.FA([True, False, True, True, True])
>>> c2 = rt.Categorical(values=["b", "a", "c", "b", "c"], invalid="b", filter=f)
>>> c2
Categorical([b, Filtered, c, b, c]) Length: 5
  FastArray([1, 0, 2, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'b', b'c'], dtype='|S1') Unique count: 2
>>> c2.isnotnan()  # Only the invalid category returns False for Cat.isnotnan.
FastArray([False,  True,  True, False,  True])
>>> ~c2.isfiltered()  # Only the Filtered value returns False for the negation of Cat.isfiltered.
FastArray([ True, False,  True,  True,  True])

Invalid categories in a Categorical are different from regular integer NaN values. An integer NaN is a valid category and is True for Cat.isnotnan():

>>> a = rt.FA([1, 2, 3, 4])
>>> a[3] = a.inv  # Set the last value to an integer NaN.
>>> a
FastArray([          1,           2,           3, -2147483648])
>>> c3 = rt.Categorical(values=a, invalid=2)  # Make 2 an invalid category.
>>> c3
Categorical([1, 2, 3, -2147483648]) Length: 4
  FastArray([2, 3, 4, 1], dtype=int8) Base Index: 1
  FastArray([-2147483648,           1,           2,           3]) Unique count: 4
>>> c3.invalid_category()
2
>>> c3.isnotnan()  # Only the invalid category returns False for Cat.isnotnan.
FastArray([ True, False,  True,  True])
>>> c3.expand_array.isnotnan()  # Only the integer NaN returns False for FA.isnotnan.
FastArray([ True,  True,  True, False])
lock()[source]

Locks the categories to none can be added, removed, or change.

map(mapper, invalid=None)[source]

Maps existing categories to new categories and returns a re-expanded array.

Parameters:
  • mapper (dictionary or numpy.array or FastArray) –

    • dictionary maps existing categories -> new categories

    • array must be the same size as the existing category array

  • invalid – Optionally specify an invalid value to insert for existing categories that were not found in the new mapping. If no invalid is set, the default invalid for the result’s dtype will be used.

Returns:

Re-expanded array.

Return type:

FastArray

Notes

Maybe to add: - option to return categorical instead of re-expanding - dtype for return array

Examples

New strings (all exist, no invalids in original):

>>> c = rt.Categorical(['b','b','c','a','d'], ordered=False)
>>> mapping = {'a': 'AA', 'b': 'BB', 'c': 'CC', 'd': 'DD'}
>>> c.map(mapping)
FastArray([b'BB', b'BB', b'CC', b'AA', b'DD'], dtype='|S3')

New strings (not all exist, no invalids in original):

>>> mapping = {'a': 'AA', 'b': 'BB', 'c': 'CC'}
>>> c.map(mapping, invalid='INVALID')
FastArray([b'BB', b'BB', b'CC', b'AA', b'INVALID'], dtype='|S7')

String to float:

>>> mapping = {'a': 1., 'b': 2., 'c': 3.}
>>> c.map(mapping, invalid=666)
FastArray([  2.,   2.,   3.,   1., 666.])

If no invalid is specified, the default invalid will be used:

>>> c.map(mapping)
FastArray([ 2.,  2.,  3.,  1., nan])

Mapping as array (must be the same size):

>>> mapping = rt.FastArray(['w','x','y','z'])
>>> c.map(mapping)
FastArray([b'w', b'w', b'x', b'y', b'z'], dtype='|S3')
mapping_add(code, value)[source]

Add a new code -> value mapping to categories.

mapping_new(mapping)[source]

Replace entire mapping dictionary. No codes in the Categorical’s integer FastArray will be changed. If they are not in the new mapping, they will appear as Invalid.

mapping_remove(code)[source]

Remove the category associated with an integer code.

mapping_replace(code, value)[source]

Replace a single integer code with a single value.

classmethod newclassfrominstance(instance, origin)[source]

Used when the FastArray portion of the Categorical is updated, but not the reset of the class attributes.

Examples

>>> c=rt.Cat(['a','b','c'])
>>> rt.Cat.newclassfrominstance(c._fa[1:2],c)
Categorical([b]) Length: 1
  FastArray([2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
notna(*args, **kwargs)[source]

See Categorical.isnotnan.

nth(arr, n=1, transform=None, filter=None, showfilter=None)[source]

Select the nth row from each group.

Parameters:
  • arr (array or list of array) – The array of values to select from.

  • n (int) – A single nth value for the row.

  • transform (bool) – If True, the output will have the same shape as arr. If False, the output will typically have the same shape as the Categorical.

  • filter (array of bool, optional) – Elements to include in the operation.

  • showfilter (bool) – If True, the output contains an extra row representing the operation applied to a stack of all the elements that were filtered out (both at Categorical creation and in this operation, using a filter.)

Examples

>>> ds = rt.Dataset({'A': rt.Categorical(['a', 'a', 'b', 'a', 'b']),
...                  'B': [rt.nan, 2, 3, 4, 5]})
>>> c = ds.A
>>> c.nth([ds.A, ds.B], 0)
*A      B
--   ----
a     nan
b    3.00

[2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth([ds.A, ds.B], 1)
*A     B
--  ----
a   2.00
b   5.00

[2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth([ds.A, ds.B], -1)
*A      B
--   ----
a    4.00
b    5.00

[2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth(ds.B, -2, transform=True)
#      B
-   ----
0   2.00
1   2.00
2   3.00
3   2.00
4   3.00

[5 rows x 1 columns] total bytes: 40.0 B
>>> c.nth(ds.B, 1, filter=ds.B.isnotnan())
*A      B
--   ----
a    4.00
b    5.00

[2 rows x 2 columns] total bytes: 18.0 B
>>> c.nth(ds.B, -2, filter=ds.A!='b', showfilter=True)
*A            B
--------   ----
Filtered   3.00
a          2.00
b           nan

[3 rows x 2 columns] total bytes: 48.0 B
numba_apply(userfunc, *args, filter=None, transform=False, **kwargs)[source]

Applies a user numba function over the groups of a categorical. Numba function should either return a scalar or np.array the size of the input array. If numba function returns scalar, set transform = True to reshape result to size of categorical.

Parameters:
  • userfunc (a numba function) –

  • args (a np.array, userfunc must return scalar or np.array of same length) –

  • filter (boolean filter) –

  • kwargs (kwargs to pass to userfunc) –

  • transform (Set to true if userfunc returns a scalar, but you want re-expanded to the size of original array) –

Return type:

Dataset with categorical keys for scalar function with transform = False, otherwise aligned to original categorical

nunique()[source]

Number of unique values that occur in the Categorical. Does not include invalids. Not the same as the length of possible uniques.

Categoricals based on dictionary mapping / enum will return unique count including all possibly invalid values from underlying array.

one_hot_encode(dtype=None, categories=None, return_labels=True)[source]

Generate one hot encoded arrays from each unique category.

Parameters:
  • dtype (data-type, optional) – The numpy data type to use for the one-hot encoded arrays. If dtype is not specified (i.e. is None), the encoded arrays will default to using a np.float32 representation.

  • categories (list or array-like, optional) – List or array containing unique category values to one-hot encode. Specify this when you only want to encode a subset of the unique category values. Defaults to None, in which case all categories are encoded.

  • return_labels (bool) – Not implemented.

Returns:

  • col_names (FastArray) – FastArray of column names (unique categories as unicode strings)

  • encoded_arrays (list of FastArray) – list of one-hot encoded arrays for each category

Notes

Unicode is used because the column names are often going to a dataset.

Performance warning for large amount of uniques - an array will be generated for ALL of them

Examples

Default:

>>> c = rt.Categorical(FA(['a','a','b','c','a']))
>>> c.one_hot_encode()
(FastArray(['a', 'b', 'c'], dtype='<U1'),
 [FastArray([1., 1., 0., 0., 1.], dtype=float32),
  FastArray([0., 0., 1., 0., 0.], dtype=float32),
  FastArray([0., 0., 0., 1., 0.], dtype=float32)])

Custom dtype:

>>> c.one_hot_encode(dtype=np.int8)
c.one_hot_encode(dtype=np.int8)
(FastArray(['a', 'b', 'c'], dtype='<U1'),
 [FastArray([1, 1, 0, 0, 1], dtype=int8),
  FastArray([0, 0, 1, 0, 0], dtype=int8),
  FastArray([0, 0, 0, 1, 0], dtype=int8)])

Specific categories:

>>> c.one_hot_encode(categories=['a','b'])
(FastArray(['a', 'b'], dtype='<U1'),
 [FastArray([ True,  True, False, False,  True]),
  FastArray([False, False,  True, False, False])])

Multikey:

>>> #NOTE: The double-quotes in the category names are not part of the actual string.
>>> c = rt.Categorical([rt.FA(['a','a','b','c','a']), rt.FA([1, 1, 2, 3, 1]) ] )
>>> c.one_hot_encode()
(FastArray(["('a', '1')", "('b', '2')", "('c', '3')"], dtype='<U10'),
 [FastArray([1., 1., 0., 0., 1.], dtype=float32),
  FastArray([0., 0., 1., 0., 0.], dtype=float32),
  FastArray([0., 0., 0., 1., 0.], dtype=float32)])

Mapping:

>>> c = rt.Categorical(rt.arange(3), {'a':0, 'b':1, 'c':2})
>>> c.one_hot_encode()
(FastArray(['a', 'b', 'c'], dtype='<U1'),
 [FastArray([1., 0., 0.], dtype=float32),
  FastArray([0., 1., 0.], dtype=float32),
  FastArray([0., 0., 1.], dtype=float32)])
set_name(name)[source]

If the grouping dict contains a single item, rename it.

See also

Grouping.set_name, FastArray.set_name

set_valid(filter=None)[source]

Apply a filter to the categorical’s values. If values no longer occur in the uniques, the uniques will be reduced, and the index will be recalculated.

Parameters:

filter (boolean array, optional) – If provided, must be the same size as the categorical’s underlying array. Will be used to mask non-unique values. If not provided, categorical may still reduce its unique values to the unique occuring values.

Returns:

c – New categorical with possibly reduced uniques.

Return type:

Categorical

shift(arr, window=None, *, periods=None, filter=None)[source]

Shift values in each group by the specified number of periods.

Where the shift introduces a missing value, the missing value is filled with the invalid value for the array’s data type (for example, NaN for floating-point arrays or the sentinel value for integer arrays).

Parameters:
  • arr (array or list of array) – The array of values to shift.

  • window (int, default 1) – The number of periods to shift. Can be a negative number to shift values backward.

  • periods (int, optional, default 1) – Can use periods instead of window for Pandas parameter support.

  • filter (FastArray of bool, optional) – Set of rows to include. Filtered out rows are skipped by the shift and become NaN in the output.

Returns:

A Dataset containing a column of shifted values.

Return type:

Dataset

See also

Categorical.shift_cat

Shift the values of a Categorical.

FastArray.shift

Shift the values of a FastArray.

DateTimeNano.shift

Shift the values of a DateTimeNano array.

Examples

With the default window=1:

>>> c = rt.Cat(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'])
>>> fa = rt.arange(9)
>>> shift_val = c.shift(fa)
>>> shift_val
#   col_0
-   -----
0     Inv
1       0
2       1
3     Inv
4       3
5       4
6     Inv
7       6
8       7

With window=2:

>>> shift_val_2 = c.shift(fa, window=2)
>>> shift_val_2
#   col_0
-   -----
0     Inv
1     Inv
2       0
3     Inv
4     Inv
5       3
6     Inv
7     Inv
8       6

With window=-1:

>>> shift_neg = c.shift(fa, window=-1)
>>> shift_neg
#   col_0
-   -----
0       1
1       2
2     Inv
3       4
4       5
5     Inv
6       7
7       8
8     Inv

With filter:

>>> filt = rt.FA([True, True, True, True, False, True, False, True, True])
>>> shift_filt = c.shift(fa, filter=filt)
>>> shift_filt
#   col_0
-   -----
0     Inv
1       0
2       1
3     Inv
4     Inv
5       3
6     Inv
7     Inv
8       7

Results put in a Dataset to show the shifts in relation to the categories:

>>> ds = rt.Dataset()
>>> ds.c = c
>>> ds.shift_val = shift_val
>>> ds.shift_val_2 = shift_val_2
>>> ds.shift_neg = shift_neg
>>> ds
#   c   shift_val   shift_val_2   shift_neg
-   -   ---------   -----------   ---------
0   a         Inv           Inv           1
1   a           0           Inv           2
2   a           1             0         Inv
3   b         Inv           Inv           4
4   b           3           Inv           5
5   b           4             3         Inv
6   c         Inv           Inv           7
7   c           6           Inv           8
8   c           7             6         Inv

Shift two arrays:

>>> fa2 = rt.arange(10, 19)
>>> shift_val_3 = c.shift([fa, fa2])
>>> shift_val_3
#   col_0   col_1
-   -----   -----
0     Inv     Inv
1       0      10
2       1      11
3     Inv     Inv
4       3      13
5       4      14
6     Inv     Inv
7       6      16
8       7      17
shift_cat(periods=1)[source]

See FastArray.shift() Instead of nan or sentinel values, like shift on a FastArray, the invalid category will appear. Returns a new categorical.

Examples

>>> rt.Cat(['a','b','c']).shift(1)
Categorical([Filtered, a, b]) Length: 3
  FastArray([0, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
shrink(newcats, misc=None, inplace=False)[source]
Parameters:
  • newcats (array-like) – New categories to replace the old - typically a reduced set.

  • misc (scalar, optional (often a string)) – Value to use as category for items not found in new categories. This will be added to the new categories. If not provided, all items not found will be set to a filtered bin.

  • inplace (bool) – If True, re-index the categorical’s underlying FastArray. Otherwise, return a new categorical with a new index and grouping object.

Returns:

A new Categorical with the new index.

Return type:

Categorical

Examples

Base index 1, no misc

>>> c = rt.Categorical([1,2,3,1,2,3,0], ['a','b','c'])
>>> c.shrink(['b','c'])
Categorical([Filtered, b, c, Filtered, b, c, Filtered]) Length: 7
  FastArray([0, 1, 2, 0, 1, 2, 0]) Base Index: 1
  FastArray([b'b', b'c'], dtype='|S1') Unique count: 2

Base index 1, filtered bins and misc

>>> c.shrink(['b','c'], 'AAA').sum(rt.arange(7), showfilter=True)
*key_0     col_0
--------   -----
Filtered       6
AAA            3
b              5
c              7

Base index 0, with misc

>>> c = rt.Categorical([0,1,2,0,1,2], ['a','b','c'], base_index=0)
>>> c.shrink(['b','c'], 'AAA')
Categorical([AAA, b, c, AAA, b, c]) Length: 6
  FastArray([0, 1, 2, 0, 1, 2], dtype=int8) Base Index: 0
  FastArray(['AAA', 'b', 'c'], dtype='<U3') Unique count: 3

See also

Categorical.map

str()

Casts an array of byte strings or unicode as FAString.

Enables a variety of useful string manipulation methods.

Return type:

FAString

Raises:

TypeError – If the FastArray is of dtype other than byte string or unicode

See also

np.chararray, np.char, rt.FAString.apply

Examples

>>> s=rt.FA(['this','that','test ']*100_000)
>>> s.str.upper
FastArray([b'THIS', b'THAT', b'TEST ', ..., b'THIS', b'THAT', b'TEST '],
          dtype='|S5')
>>> s.str.lower
FastArray([b'this', b'that', b'test ', ..., b'this', b'that', b'test '],
          dtype='|S5')
>>> s.str.removetrailing()
FastArray([b'this', b'that', b'test', ..., b'this', b'that', b'test'],
          dtype='|S5')
to_arrow(type=None, *, preserve_fixed_bytes=False, empty_strings_to_null=True)[source]

Convert this Categorical to a pyarrow.Array.

Parameters:
  • type (pyarrow.DataType, optional, defaults to None) – Unused.

  • preserve_fixed_bytes (bool, optional, defaults to False) – Unused.

  • empty_strings_to_null (bool, optional, defaults To True) – Unused.

Return type:

pyarrow.Array or pyarrow.ChunkedArray

Notes

TODO: Consider whether we should store all Categoricals as Struct-type pyarrow arrays, since that’d

allow us to preserve the key names, even for single-key Categoricals.

unlock()[source]

Unlocks the categories so new categories can be added, or existing categories can be removed or changed.

class riptable.rt_categorical.Categories(*args, base_index=1, invalid_category=None, ordered=False, unicode=False, _from_categorical=False, **kwargs)[source]

Holds categories for each Categorical instance. This adds a layer of abstraction to Categorical.

Categories objects are constructed in Categorical’s constructor and other internal routines such as merging operations. The Categories object is responsible for translating the values in the Categorical’s underlying fast array into the correct bin in the categories. It performs different operations to retrieve the correct bins based on it’s mode.

Parameters:
  • categories – main categories data - can also be empty list

  • invalid_category (str) – string that will be displayed for an invalid index

  • invalid_index – sentinel value for a particular index; this invalid will be displayed differntly in IntEnum/Dictionary modes

  • ordered (bool) – flag for list list modes, ordered categories can use a binary search for finding bins

  • auto_add_categories – if a setitem (bracket-indexing with a value) is called, and the value is not in the categories, this flag allows it to be added automatically.

  • na_added – for some constructors, the calling Categorical has already added the invalid category

  • base_index – the calling Categorical passes in the index offset for list and grouping modes

  • multikey – the categories information is stored in a multikey dictionary up for deletion

  • groupbypossibly merge with the multikey flag

Notes

There are multiple modes in which a Categories object can operate.

StringArray: (list_modes) Two paths for initializations use the categories routines: TB Filled in LATER array and list of unique categories. String mode will be set to unicode or bytes so the correct encoding/decoding can be performed before comparison/searching operations. - from list of strings (unique/ismember) - from list of strings paired with unique string categories (unique/ismember) - from codes paired with unique string categories (assignment will happen without unique/ismember) - from pandas categoricals (with string categories) (assignment will happen without unique/ismember) - from matlab categoricals (with string categories) (assignment will happen without unique/ismember)

NumericArray: (list_modes) this is not currently implemented as default behavior, but if enabled it will handle these constructors - from list of integers - from list of floats - from codes paired with unique integer categories - from codes paired with unique float categories - from list of floats paired with unique float categories - from pandas categoricals with numeric categories

IntEnum / Dictionary: (dict_modes) Two dictionaries will be held: one mapping strings to integers, another mapping integers to strings. This mode requires that all strings and their corresponding codes are one-to-one. - from codes paired with IntEnum object - from codes paired with Integer -> String dictionary - from codes paired wtih String -> Integer dictionary not implemented

Grouping All categories objects in Grouping mode hold categories in a dictionary, even if the dictionary only contains one item. Information for indexed items will appear in a tuple if multiple columns are being held. - from list of key columns - from dictionary of key columns - from single list of numeric type - from dataset not implemented

property _first_list

Returns the first column when categories are in a dictionary, or the list if the categories are in a list mode.

property base_index
property grouping
property int2strdict
property isbytes

True if uniques are held in single array of bytes. Otherwise False.

property isenum

True if uniques have an enum / dictionary mapping for uniques. Otherwise False.

See also: GroupingEnum

property ismultikey

True if unique dict holds multiple arrays. False if unique dict holds single array or in enum mode.

property issinglekey

True if unique dict holds single array. False if unique dict hodls multiple arrays or in enum mode.

property isunicode

True if uniques are held in single array of unicode. Otherwise False.

property mode
property name: str
property ncols: int

Returns the number of key columns in a multikey categorical or 1 if a single key’s categories are being held in a dictionary.

property nrows: int

Returns the number of unique categories in a multikey categorical.

property str2intdict
property uniquedict
property uniquelist
_grouping: riptable.rt_grouping.Grouping
default_colname = 'key_0'
dict_modes
list_modes
multikey_spacer = ' '
numeric_modes
string_modes
__getitem__(value)[source]
__len__()[source]

TODO: consider changing length of enum/dict mode categories to be the length of the dictionary. using max int so the calling Categorical can properly recast the integer array.

__repr__()[source]

Return repr(self).

__str__()[source]

Return str(self).

_array_edit(value, new_value=None, how='add')[source]
_build_string()[source]
_copy(deep=True)[source]

Creates a new categories object and possibly performs a deep copy of category list. Currently only supports Categories in list modes.

_get_array()[source]
_get_codes()[source]
_get_dict()[source]
_get_mapping()[source]
_getitem_enum(value)[source]

At this point, the categorical’s underlying fast array’s __getitem__ has already been hit. It will only execute if the return value was scalar. No need to handle lists/arrays/etc. - which take a different path in Categorical.__getitem__

The value should always be a single integer.

this will return a single item or list of items from int/string index Enums will always return an array of values, even if there is only one entry. Enums dictionaries can only be looked up with unicode strings, so bytes will be converted.

_getitem_multikey(value)[source]
_getitem_singlekey(value)[source]
_is_valid_mapping_code(value)[source]
_mapping_edit(code, value=None, how='add')[source]
_mapping_new(mapping)[source]
_possibly_add_categories(new_categories)[source]

Add non-existing categories to categories. If categories were added, an array is returned to fix the old indexes. If no categories were added, returns None.

classmethod build_dicts_enum(enum)[source]

Builds forward/backward dictionaries from IntEnums. If there are multiple identifiers with the same, WARN!

classmethod build_dicts_python(python_dict)[source]

Categoricals can be initialized with a dictionary of string to integer or integer to string. Python dictionaries accept multiple types for their keys, so the dictionaries need to check types as they’re being constructed.

categories_as_dict()[source]

Groupby keys can be prepared for the calling Categorical.

copy(deep=True)[source]

Wrapper for internal _copy.

classmethod from_grouping(grouping, invalid_category=None)[source]
get_categories()[source]

TODO: decide what to return for int enum categories. for now returning list of category strings

get_category_index(s)[source]

Returns an integer or float for logical comparisons with the Categorical’s index array. Floating point return ensures that LTE/GTE functions work properly

get_category_match_index(fld)[source]

Returns the indices of matching strings in the unique list. The Categorical instance will compare these integers to those in its underlying array to generate a boolean mask.

get_multikey_index(multikey)[source]

Multikey categoricals can be indexed by tuple. This is an internal routine for getitem, setitem, and logical comparisons. Valid return will be adjusted for the base index of the categorical (currently always 1 for multikey)

Parameters:

multikey (tuple of items to search for in multiple columns) –

Returns:

location of multikey + base index, or -1 if not found

Return type:

int

Examples

>>> c = rt.Categorical([rt.arange(5), rt.arange(5)])
>>> c
Categorical([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]) Length: 5
  FastArray([1, 2, 3, 4, 5], dtype=int8) Base Index: 1
  {'key_0': FastArray([0, 1, 2, 3, 4]), 'key_1': FastArray([0, 1, 2, 3, 4])} Unique count: 5
>>> c._categories_wrap.get_multikey_index((0,0))
1
match_str_to_category(fld)[source]

If necessary, convert the string or list of strings to the same type as the categories so that correct comparisons can be made.

possibly_invalid(value)[source]

If the calling categorical’s values are set to a bad index, the !<badindex> will be returned. If the bad index is the sentinel value for that integer type, !<inv> will be returned

riptable.rt_categorical.CatZero(values, categories=None, ordered=None, sort_gb=None, lex=None, base_index=0, **kwargs)[source]

Calls Categorical() with base_index keyword set to 0.

riptable.rt_categorical.categorical_convert(v, base_index=0)[source]
Parameters:

v (a pandas categorical) –

Returns:

  • Returns the two building blocks to make an rt categorical (integer array, and what that indexes into)

  • whatever the pandas categorical underlying object is we try to convert it to a string to

  • detach from object references and free of pandas references

  • pandas also uses -1 to indicate an out of bounds value, when we detect this, we insert an item in the beginning

Examples

>>> p=pd.Categorical(['a','b','b','a','a','c','b','c','a','a'], categories=['a','b'])
>>> test=Categorical(p)

from a cut

>>> a=rt.FA(rt.arange(10.0)+.1)
>>> p=pd.cut(a,[0,3,6,7])
(0, 3], (0, 3], (3, 6], (3, 6], (3, 6], (6, 7], NaN, NaN, NaN]
>>> test=Categorical(p)
Categorical([(0, 3], (0, 3], (0, 3], (3, 6], (3, 6], (3, 6], (6, 7], nan, nan, nan])
riptable.rt_categorical.categorical_merge_dict(list_categories, return_is_safe=False, return_type=Categorical)[source]

Checks to make sure all unique string values in all dictionaries have the same corresponding integer in every categorical they appear in. Checks to make sure all unique integer values in all dictionaries have the same corresponding string in every categorical they appear in.