Riptable Categoricals – Filtering

Categoricals that use base-1 indexing can be filtered when they’re created or anytime afterwards. Filters can also be applied on a one-off basis at the time of an operation.

Values or entire categories can be filtered. Filtered items are mapped to 0 in the integer mapping array and omitted from operations.

On this page:

Filtering at Categorical creation

Provide a filter argument to filter values at Categorical creation. Filtered values are omitted from all operations on the Categorical.

Notes:

  • Only base-1 indexing is supported – the 0 is reserved for Filtered values.

  • You can’t use a dictionary or IntEnum to create a Categorical with a filter.

You can filter out certain values or an entire category:

>>> f = rt.FA([True, True, False, True, True, True, True])  # The mask must be an array, not a list.
>>> c = rt.Categorical(["a", "a", "b", "a", "c", "c", "b"], filter=f)  # One "b" value is filtered.
>>> c
Categorical([a, a, Filtered, a, c, c, b]) Length: 7
  FastArray([1, 1, 0, 1, 3, 3, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

>>> c.count()
*key_0   Count
------   -----
a            3
b            1
c            2

In the example below, an entire category is filtered. If the Categorical is constructed from values without provided categories, categories that are entirely filtered out do not appear in the array of unique categories or in the results of operations:

>>> vals = rt.FA(["a", "a", "b", "a", "c", "c", "b"])
>>> f = (vals != "b")  # Filter out all "b" values.
>>> c = rt.Categorical(vals, filter=f)
>>> c
Categorical([a, a, Filtered, a, c, c, Filtered]) Length: 7
  FastArray([1, 1, 0, 1, 2, 2, 0], dtype=int8) Base Index: 1
  FastArray([b'a', b'c'], dtype='|S1') Unique count: 2

>>> c.count()
*key_0   Count
------   -----
a            3
c            2

If categories are provided, entirely filtered-out categories do appear in the array of unique categories and the results of operations:

>>> c = rt.Categorical(vals, categories=["a", "b", "c"], filter=f)
>>> c
Categorical([a, a, Filtered, a, c, c, Filtered]) Length: 7
  FastArray([1, 1, 0, 1, 3, 3, 0], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

>>> c.count()
*key_0   Count
------   -----
a            3
b            0
c            2

Multi-key Categoricals can also be filtered at creation.

>>> f = rt.FA([False, False, True, False, True, True])
>>> vals1 = rt.FastArray(["a", "b", "b", "a", "b", "a"])
>>> vals2 = rt.FastArray([2, 1, 1, 3, 2, 1])
>>> rt.Categorical([vals1, vals2], filter=f)
Categorical([Filtered, Filtered, (b, 1), Filtered, (b, 2), (a, 1)]) Length: 6
  FastArray([0, 0, 1, 0, 2, 3], dtype=int8) Base Index: 1
  {'key_0': FastArray([b'b', b'b', b'a'], dtype='|S1'), 'key_1': FastArray([1, 2, 1])} Unique count: 3

Categoricals using base-0 indexing can’t be filtered at creation:

>>> f = rt.FA([False, False, True, False, True, True, False])
>>> try:
...    rt.Categorical([0, 1, 1, 2, 2, 0, 1], base_index=0, filter=f)
... except ValueError as e:
...    print("ValueError:", e)
ValueError: Filtering is not allowed for base index 0. Use base-1 indexing instead.

Categoricals created using a dictionary or IntEnum can’t be filtered by passing a filter argument at creation, but a Filtered category can be included by by using the integer sentinel value as the Filtered mapping code. They can also be filtered after creation using set_valid().

Using the filter argument gets an error:

>>> f = rt.FA([True, False, False, False, False])
>>> d = {44: "StronglyAgree", 133: "Agree", 75: "Disagree", 1: "StronglyDisagree", 144: "NeitherAgreeNorDisagree" }
>>> codes = [1, 44, 144, 133, 75]
>>> try:
...    rt.Categorical(codes, categories=d, filter=f)
... except TypeError as e:
...    print("TypeError:", e)
TypeError: Grouping from enum does not support pre-filtering.

However, you can include a Filtered category by using the integer sentinel value in your mapping:

>>> d = {-2147483648: "Filtered", 44: "StronglyAgree", 133: "Agree", 75: "Disagree", 1: "StronglyDisagree", 144: "NeitherAgreeNorDisagree" }
>>> codes = [-2147483648, 44, 144, 133, 75]
>>> c = rt.Categorical(codes, categories=d)
>>> c
Categorical([Filtered, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([-2147483648,          44,         144,         133,          75]) Base Index: None
  {-2147483648:'Filtered', 44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 5

>>> from enum import IntEnum
>>> class LikertDecision(IntEnum):
...     # A Likert scale with the typical five-level Likert item format.
...     Filtered = -2147483648
...     StronglyAgree = 44
...     Agree = 133
...     Disagree = 75
...     StronglyDisagree = 1
...     NeitherAgreeNorDisagree = 144
>>> codes = [-2147483648, 1, 44, 144, 133, 75]
>>> rt.Categorical(codes, categories=LikertDecision)
Categorical([Filtered, StronglyDisagree, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 6
  FastArray([-2147483648,           1,          44,         144,         133,          75]) Base Index: None
  {-2147483648:'Filtered', 44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 6

You can also filter an existing category after creation using set_valid (see below).

Filtering after Categorical creation

Calling set_valid on a Categorical returns a filtered copy of the Categorical.

>>> c = rt.Categorical(["a", "a", "b", "a", "c", "c", "b"])
>>> c
Categorical([a, a, b, a, c, c, b]) Length: 7
  FastArray([1, 1, 2, 1, 3, 3, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> f = rt.FA([True, True, False, True, True, True, True])  # Filter out 1 "b" value.
>>> c.set_valid(f)
Categorical([a, a, Filtered, a, c, c, b]) Length: 7
  FastArray([1, 1, 0, 1, 3, 3, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

The original Categorical isn’t modified:

>>> c
Categorical([a, a, b, a, c, c, b]) Length: 7
  FastArray([1, 1, 2, 1, 3, 3, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Entirely filtered-out bins are removed from the array of unique categories:

>>> vals = rt.FA(["a", "a", "b", "a", "c", "c", "b"])
>>> f = (vals != "b")  # Filter out all "b" values.
>>> c.set_valid(f)
Categorical([a, a, Filtered, a, c, c, Filtered]) Length: 7
  FastArray([1, 1, 0, 1, 2, 2, 0], dtype=int8) Base Index: 1
  FastArray([b'a', b'c'], dtype='|S1') Unique count: 2

A Categorical created with a mapping dictionary or IntEnum can be filtered after creation. Filtered values are mapped to the integer sentinel value:

>>> d = {44: "StronglyAgree", 133: "Agree", 75: "Disagree", 1: "StronglyDisagree", 144: "NeitherAgreeNorDisagree" }
>>> codes = [1, 44, 144, 133, 75]
>>> c = rt.Categorical(codes, categories=d)
>>> c
Categorical([StronglyDisagree, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([  1,  44, 144, 133,  75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 5
>>> f = rt.FA([False, True, True, True, True])  # Filter out 1: "StronglyDisagree".
>>> c.set_valid(f)
Categorical([Filtered, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([-2147483648,          44,         144,         133,          75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 144:'NeitherAgreeNorDisagree', -2147483648:'Filtered'} Unique count: 5

>>> class LikertDecision(IntEnum):
...     # A Likert scale with the typical five-level Likert item format.
...     StronglyAgree = 44
...     Agree = 133
...     Disagree = 75
...     StronglyDisagree = 1
...     NeitherAgreeNorDisagree = 144
>>> codes = [1, 44, 144, 133, 75]
>>> c = rt.Categorical(codes, categories=LikertDecision)
>>> c
Categorical([StronglyDisagree, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([  1,  44, 144, 133,  75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 5
>>> f = rt.FA([False, True, True, True, True])  # Filter out 1: "StronglyDisagree".
>>> c.set_valid(f)
Categorical([Filtered, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([-2147483648,          44,         144,         133,          75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 144:'NeitherAgreeNorDisagree', -2147483648:'Filtered'} Unique count: 5

Filtering can be useful to re-index a Categorical so only its occurring uniques are shown:

>>> f = (vals != "b")
>>> c2 = c[f]
>>> c2
Categorical([a, a, a, c, c]) Length: 5
  FastArray([1, 1, 1, 3, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

>>> c2.sum(rt.arange(5))
*key_0   col_0
------   -----
a            3
b            0
c            7

>>> # Use set_valid to create a re-indexed Categorical:.
>>> c3 = c2.set_valid()
>>> c3
Categorical([a, a, a, c, c]) Length: 5
  FastArray([1, 1, 1, 2, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'c'], dtype='|S1') Unique count: 2

>>> c3.count()
*key_0   Count
------   -----
a            3
c            2

>>> c3.sum(rt.arange(5))
*key_0   col_0
------   -----
a            3
c            7

Filter an operation on a Categorical

To filter one operation (such as a sum), use the filter argument for the operation. Filtered results are omitted, but any entirely filtered categories still appear in the results:

>>> # Put the Categorical in a Dataset to better see
>>> # the associated values used in the operation.
>>> ds = rt.Dataset()
>>> vals = rt.FA(["a", "a", "b", "a", "c", "c", "b"])
>>> c = rt.Categorical(vals)
>>> ds.cats = c
>>> ds.ints = rt.arange(7)
>>> ds
#   cats   ints
-   ----   ----
0   a         0
1   a         1
2   b         2
3   a         3
4   c         4
5   c         5
6   b         6

>>> f = rt.FA([True, True, False, True, True, True, True])  # One "b" value is filtered.
>>> c.sum(ints, filter=f)
*key_0   ints
------   ----
a           4
b           6
c           9

>>> f = (cats != "b")  # Filter out all "b" values.
>>> c.sum(ints, filter=f)
*key_0   ints
------   ----
a           4
b           0
c           9

The Categorical doesn’t retain the filter:

>>> c
Categorical([a, a, b, a, c, c, b]) Length: 7
  FastArray([1, 1, 2, 1, 3, 3, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

To see the results of the operation applied to all Filtered values (irrespective of their group), use the showfilter argument:

>>> # A "b" value (2) and a "c" value (5) are filtered.
>>> f = rt.FA([True, True, False, True, True, False, True])
>>> c.sum(ints, filter=f, showfilter=True)
*key_0     ints
--------   ----
Filtered      7
a             4
b             6
c             4

>>> f = (cats != "a")  # Filter out all "a" values.
>>> c.sum(ints, filter=f, showfilter=True)
*key_0     ints
--------   ----
Filtered      4
a             0
b             8
c             9

Set a name for filtered values

You can set a string for displaying filtered values using filtered_set_name:

>>> vals = rt.FA(["a", "a", "b", "a", "c", "c", "b"])
>>> f = (vals != "b")
>>> c = rt.Categorical(vals, filter=f)
>>> c.filtered_set_name("FNAME")
>>> c
Categorical([a, a, FNAME, a, c, c, FNAME]) Length: 7
  FastArray([1, 1, 0, 1, 2, 2, 0], dtype=int8) Base Index: 1
  FastArray([b'a', b'c'], dtype='|S1') Unique count: 2

See the name set for filtered values

To see the string used when filtered values are displayed, use the filtered_name property:

>>> c.filtered_name
'FNAME'