Riptable Categoricals – Base Index

Categoricals default to base-1 indexing

The 0 index is reserved for Filtered values and categories:

>>> vals = rt.FA(["b", "a", "a", "c", "a", "b"])
>>> f = rt.FA([False, True, True, True, True, True])
>>> rt.Categorical(vals, filter=f)
Categorical([Filtered, a, a, c, a, b]) Length: 6
  FastArray([0, 1, 1, 3, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Note that “b” doesn’t appear in the array of unique categories because it’s entirely filtered out:

>>> f = (vals != "b")
>>> rt.Categorical(vals, filter=f)
Categorical([Filtered, a, a, c, a, Filtered]) Length: 6
  FastArray([0, 1, 1, 2, 1, 0], dtype=int8) Base Index: 1
  FastArray([b'a', b'c'], dtype='|S1') Unique count: 2

Provided indices are assumed to be base-1, with the 0 index indicating invalid values:

>>> cats = rt.FA(["a", "b", "c"])
>>> rt.Categorical([1, 0, 0, 2, 0, 1], categories=cats])
Categorical([a, Filtered, Filtered, b, Filtered, a]) Length: 6
  FastArray([1, 0, 0, 2, 0, 1]) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Matlab also reserves 0 for invalid values, so a Categorical created with from_matlab=True must have a base-1 index:

>>> rt.Categorical([0.0, 1.0, 1.0, 3.0, 1.0, 2.0], categories=cats, from_matlab=True)
Categorical([Filtered, a, a, c, a, b]) Length: 6
  FastArray([0, 1, 1, 3, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Same with a Categorical converted from Pandas:

>>> import pandas as pd
>>> pdc = pd.Categorical(["a", "a", "z", "b", "c"], categories=cats)
>>> pdc
['a', 'a', NaN, 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> rt.Categorical(pdc)
Categorical([a, a, Filtered, b, c]) Length: 5
  FastArray([1, 1, 0, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Multi-key Categorical:

>>> f = rt.FA([False, False, True, False, True, True])
>>> rt.Categorical([rt.FA(["b", "a", "a", "c", "a", "b"]), rt.arange(6)], filter=f)
Categorical([Filtered, Filtered, (a, 2), Filtered, (a, 4), (b, 5)]) Length: 6
  FastArray([0, 0, 1, 0, 2, 3], dtype=int8) Base Index: 1
  {'key_0': FastArray([b'a', b'a', b'b'], dtype='|S1'), 'key_1': FastArray([2, 4, 5])} Unique count: 3

Categoricals with no base index

Categoricals created from a mapping dictionary or IntEnum have no base index:

>>> # Integer to string mapping.
>>> d = {44: "StronglyAgree", 133: "Agree", 75: "Disagree", 1: "StronglyDisagree", 144: "NeitherAgreeNorDisagree" }
>>> codes = [1, 44, 144, 133, 75]
>>> rt.Categorical(codes, categories=d)
Categorical([StronglyDisagree, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([  1,  44, 144, 133,  75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 5

>>> # String to integer mapping.
>>> d = {"StronglyAgree": 44, "Agree": 133, "Disagree": 75, "StronglyDisagree": 1, "NeitherAgreeNorDisagree": 144 }
>>> codes = [1, 44, 144, 133, 75]
>>> rt.Categorical(codes, categories=d)
Categorical([StronglyDisagree, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([  1,  44, 144, 133,  75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 5

>>> from enum import IntEnum
>>> class LikertDecision(IntEnum):
...     # A Likert scale with the typical five-level Likert item format.
...     StronglyAgree = 44
...     Agree = 133
...     Disagree = 75
...     StronglyDisagree = 1
...     NeitherAgreeNorDisagree = 144
>>> codes = [1, 44, 144, 133, 75]
>>> rt.Categorical(codes, categories=LikertDecision)
Categorical([StronglyDisagree, StronglyAgree, NeitherAgreeNorDisagree, Agree, Disagree]) Length: 5
  FastArray([  1,  44, 144, 133,  75]) Base Index: None
  {44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 5

Note: Categoricals that have no base index can’t be filtered by passing a filter argument at creation, but they can be filtered by using the integer sentinel value as the Filtered mapping code. They can also be filtered after creation using set_valid. For examples, see Filters.

Some Categoricals can opt for base-0 indexing

Base-0 can be used if:

A mapping dictionary isn’t used. A Categorical created from a mapping dictionary does not have a base index.

A filter isn’t used at creation.

A Matlab or Pandas Categorical isn’t being converted. These both reserve 0 for invalid values.

>>> rt.Categorical(["b", "a", "a", "c", "a", "b"], base_index=0)
Categorical([b, a, a, c, a, b]) Length: 6
  FastArray([1, 0, 0, 2, 0, 1]) Base Index: 0
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

>>> rt.Categorical(["b", "a", "a", "c", "a", "b"], categories=cats, base_index=0)
Categorical([b, a, a, c, a, b]) Length: 6
  FastArray([1, 0, 0, 2, 0, 1], dtype=int8) Base Index: 0
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

>>> rt.Categorical([1, 0, 0, 2, 0, 1], categories=cats, base_index=0)
Categorical([b, a, a, c, a, b]) Length: 6
  FastArray([1, 0, 0, 2, 0, 1]) Base Index: 0
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Filtering at Categorical creation prevents base-0 indexing

>>> f = rt.FA([True, True, False, True, True, True])

>>> try:
...     rt.Categorical(["b", "a", "a", "c", "a", "b"], filter=f, base_index=0)
... except ValueError as e:
...    print("ValueError:", e)
ValueError: Filtering is not allowed for base index 0. Use base-1 indexing instead.

>>> try:
...     rt.Categorical(["b", "a", "a", "c", "a", "b"], categories=cats, filter=f, base_index=0)
... except ValueError as e:
...     print("ValueError:", e)
ValueError: Filtering is not allowed for base index 0. Use base-1 indexing instead.

>>> try:
...     rt.Categorical([1, 0, 0, 2, 0, 1], categories=cats, filter=f, base_index=0)
... except ValueError as e:
...     print("ValueError:", e)
ValueError: Filtering is not allowed for base index 0. Use base-1 indexing instead.

Categoricals created from Matlab or Pandas Categoricals can’t use base-0 indexing

Categoricals created from Matlab Categoricals must use a base-1 index in order to preserve invalid values (which are also indexed as 0 in Matlab):

>>> import pandas as pd
>>> pdc = pd.Categorical(["b", "a", "a", "c", "a", "b"])
>>> try:
...     rt.Categorical(pdc, base_index=0)
... except ValueError as e:
...     print("ValueError:", e)
ValueError: To preserve invalids, pandas categoricals must be 1-based.

>>> try:
...     rt.Categorical([2.0, 1.0, 1.0, 3.0, 1.0, 2.0], categories=cats, from_matlab=True, base_index=0)
... except ValueError as e:
...     print("ValueError:", e)
ValueError: Categoricals from matlab must have a base index of 1, got 0.