Riptable Categoricals – Invalid Categories
A category set to be invalid at Categorical creation is considered to be NaN in the sense
that isnan
returns True
for the category, but
it’s mapped to a valid index and not excluded from any operations on the Categorical.
To exclude values or categories from operations, use the filter
argument.
Note that this behavior differs from Previous invalid behavior.
Warning: If the invalid category isn’t in the provided list of unique categories and a filter is also provided at Categorical creation, the invalid category also becomes Filtered.
Categorical created from values (no user-provided categories)
Because it’s assigned to a regular bin, an invalid category is allowed for base-0 and base-1 indexing:
>>> c = rt.Categorical(["b", "a", "a", "Inv", "c", "a", "b"], invalid="Inv", base_index=0)
>>> c
Categorical([b, a, a, Inv, c, a, b]) Length: 7
FastArray([2, 1, 1, 0, 3, 1, 2]) Base Index: 0
FastArray([b'Inv', b'a', b'b', b'c'], dtype='|S3') Unique count: 4
>>> c.isnan()
FastArray([False, False, False, True, False, False, False])
>>> c = rt.Categorical(['b', 'a', 'Inv', 'a'], invalid='Inv')
>>> c
Categorical([b, a, Inv, a]) Length: 4
FastArray([3, 2, 1, 2], dtype=int8) Base Index: 1
FastArray([b'Inv', b'a', b'b'], dtype='|S3') Unique count: 3
>>> c.isnan()
FastArray([False, False, True, False])
Categorical created from values and user-provided categories
If an invalid category is specified, it must also be in the list of unique categories, otherwise an error is raised:
>>> # Included.
>>> c = rt.Categorical(["b", "a", "Inv", "a"], categories=["a", "b", "Inv"], invalid="Inv")
>>> c
Categorical([b, a, Inv, a]) Length: 4
FastArray([2, 1, 3, 1], dtype=int8) Base Index: 1
FastArray([b'a', b'b', b'Inv'], dtype='|S3') Unique count: 3
>>> # Not included.
>>> try:
... rt.Categorical(["b", "a", "Inv", "a"], categories=["a", "b"], invalid="Inv")
... except ValueError as e:
... print("ValueError:", e)
ValueError: Found values that were not in provided categories: [b'Inv']. The
user-supplied categories (second argument) must also contain the invalid item Inv.
For example: Categorical(['b','a','Inv','a'], ['a','b','Inv'], invalid='Inv')
Categorical created with a filter
Be careful when mixing invalid categories and filters.
If you filter an invalid category, it becomes Filtered and no longer invalid:
>>> c = rt.Categorical(["Inv", "a", "b", "a"], categories=["Inv", "a", "b"],
... filter=rt.FA([False, True, True, True]), invalid="Inv")
>>> c
Categorical([Filtered, a, b, a]) Length: 4
FastArray([0, 2, 3, 2], dtype=int8) Base Index: 1
FastArray([b'Inv', b'a', b'b'], dtype='|S3') Unique count: 3
>>> c.isnan()
FastArray([False, False, False, False])
Warning: If the invalid category isn’t included in the array of unique cagtegories and you also provide a filter, the invalid category also becomes Filtered even if it isn’t filtered out directly.
For comparison, here’s an example where the invalid category is included in the list of unique categories and a filter is provided. You get a warning that doesn’t apply in this case, and the filter is applied:
>>> c = rt.Categorical(["Inv", "a", "b", "a"], categories=["Inv", "a", "b"],
... filter=rt.FA([True, True, False, False]), invalid="Inv")
UserWarning: Invalid category was set to Inv. If not in provided categories, will
also appear as filtered. For example: print(Categorical(['a','a','b'], ['b'],
filter=FA([True, True, False]), invalid='a')) -> Filtered, Filtered, Filtered
The second two values are filtered:
>>> c
Categorical([Inv, a, Filtered, Filtered]) Length: 4
FastArray([1, 2, 0, 0], dtype=int8) Base Index: 1
FastArray([b'Inv', b'a', b'b'], dtype='|S3') Unique count: 3
And the invalid category is still invalid:
>>> c.isnan()
FastArray([ True, False, False, False])
However, when the invalid category is not included in the list of unique categories, the warning does apply, and the invalid category also becomes Filtered:
>>> c = rt.Categorical(["Inv", "a", "b", "a"], categories=["a", "b"],
... filter=rt.FA([True, True, False, False]), invalid="Inv")
UserWarning: Invalid category was set to Inv. If not in provided categories, will
also appear as filtered. For example: print(Categorical(['a','a','b'], ['b'],
filter=FA([True, True, False]), invalid='a')) -> Filtered, Filtered, Filtered
>>> c
Categorical([Filtered, a, Filtered, Filtered]) Length: 4
FastArray([0, 1, 0, 0], dtype=int8) Base Index: 1
FastArray([b'a', b'b'], dtype='|S1') Unique count: 2
And “Inv” is no longer considered an invalid category:
>>> c.isnan()
FastArray([False, False, False, False])
Invalid categories are not excluded from operations
Although invalid categories are recognized by the Categorical
isnan
method, they are not
excluded from operations as filtered values and categories are.
Here, “Inv” is invalid and the “b” category is filtered:
>>> vals = rt.FA(["Inv", "b", "a", "b", "c", "c", "Inv"])
>>> f = vals != "b"
>>> c = rt.Categorical(vals, invalid="Inv", filter=f)
>>> c
Categorical([Inv, Filtered, a, Filtered, c, c, Inv]) Length: 7
FastArray([1, 0, 2, 0, 3, 3, 1], dtype=int8) Base Index: 1
FastArray([b'Inv', b'a', b'c'], dtype='|S3') Unique count: 3
>>> c.isnan()
FastArray([ True, False, False, False, False, False, True])
Create some values to sum and put in a Dataset to see their relationsips to the catgegories:
>>> vals = rt.FA([1, 2, 3, 4, 5, 6, 7])
>>> ds = rt.Dataset({"c": c, "vals": vals})
>>> ds
# c vals
- -------- ----
0 Inv 1
1 Filtered 2
2 a 3
3 Filtered 4
4 c 5
5 c 6
6 Inv 7
Get the nansum
:
>>> c.nansum(vals)
*c vals
--- ----
Inv 8
a 3
c 11
The showfilter
argument confirms that only the “b” values were excluded:
>>> c.nansum(vals, showfilter=True)
*c vals
-------- ----
Filtered 6
Inv 8
a 3
c 11
If you use the filter
argument with nansum
and filter out an invalid, the
filtered invalid value is excluded from the operation:
>>> # Filter the first Inv, one of the already-filtered "b"s, and the first "c".
>>> f2 = rt.FA([False, False, True, True, True, False, True])
>>> c.nansum(vals, filter=f2, showfilter=True)
*key_0 col_0
-------- -----
Filtered 13
Inv 7
a 3
c 5
If both invalid values are filtered by the nansum
operation, the category still
appears in the result:
>>> f3 = rt.FA([False, False, True, True, True, False, False])
>>> c.nansum(vals, filter=f3)
*c vals
--- ----
Inv 0
a 3
c 5
And both invalid values are still invalid:
>>> c.isnan()
FastArray([ True, False, False, False, False, False, True])
Previous invalid behavior
Previously, the specified string was used to represent an invalid catgegory when values missing in the categories list were encountered. The invalid category was mapped to 0 in the index/codes array.
This is similar to how Pandas works, except that Pandas uses -1 for its NaN index:
>>> import pandas as pd
>>> pdc = pd.Categorical(["a", "a", "z", "b", "c"], ["a", "b", "c"])
>>> pdc
['a', 'a', NaN, 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> pdc._codes
array([ 0, 0, -1, 1, 2], dtype=int8)
>>> pd.Series([1, 1, 1, 1, 1]).groupby(pdc).count()
a 2
b 1
c 1
dtype: int64