Riptable Categoricals – Sorting and Display Order
Whether a Categorical’s categories are lexicographically sorted or considered to be
“ordered” as specified at creation depends on two parameters: ordered
and lex
.
ordered
controls whether categories are sorted lexicographically before they are mapped to integers.lex
controls whether hashing- or sorting-based logic is used to find unique values in the input array. Note that iflex=True
, the categories become sorted even ifordered=False
.
Additionally, the results of groupby operations can be displayed in sorted order with
the sort_gb
parameter.
The way these parameters interact depends on whether categories are provided when the Categorical is created.
If categories are not provided, then if ordered=True
(the default) or lex=True
they are sorted in the Categorical and in groupby results, even if sort_gb=False
.
If ordered=False
and lex=False
, the categories are held in the order of first
appearance, and groupby results are sorted only if sort_gb=True
.
ordered |
lex |
categories sorted? |
groupby results sorted? |
---|---|---|---|
T |
T |
Y |
Y |
T |
F |
Y |
Y |
F |
T |
Y |
Y |
F |
F |
N |
only if sort_gb=True |
If categories are provided, they are always held in the same order. The ordered
argument is ignored, and lex
can’t be specified. Groupby results can be displayed
in sorted order with sort_gb=True
.
Categorical created from values (no user-provided categories)
With the default ordered=True
Categories are sorted.
Groupby results are sorted regardless of whether
sort_gb=True
.
>>> vals = rt.arange(6)
>>> c = rt.Categorical(["b", "a", "a", "c", "a", "b"])
Categorical([b, a, a, c, a, b]) Length: 6
FastArray([2, 1, 1, 3, 1, 2], dtype=int8) Base Index: 1
FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.sum(vals)
*key_0 col_0
------ -----
a 7
b 5
c 3
With ordered=False
Categories are not sorted unless
lex=True
. Settinglex=True
causes a lexicographical sort to be performed to find the uniques, and the Categorical’s categories become sorted even ifordered=False
.Groupby results are not displayed sorted unless
lex=True
orsort_gb=True
.
>>> vals = rt.arange(6)
>>> c = rt.Categorical(["b", "a", "a", "c", "a", "b"], ordered=False)
>>> c
Categorical([b, a, a, c, a, b]) Length: 6
FastArray([1, 2, 2, 3, 2, 1], dtype=int8) Base Index: 1
FastArray([b'b', b'a', b'c'], dtype='|S1') Unique count: 3
>>> c.sum(vals)
*key_0 col_0
------ -----
b 5
a 7
c 3
Here, lex=True
causes the categories to become sorted even though ordered=False
.
>>> c = rt.Categorical(["b", "a", "a", "c", "a", "b"], ordered=False, lex=True)
>>> c
Categorical([b, a, a, c, a, b]) Length: 6
FastArray([2, 1, 1, 3, 1, 2], dtype=int8) Base Index: 1
FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3
>>> c.sum(vals)
*key_0 col_0
------ -----
a 7
b 5
c 3
>>> c = rt.Categorical(["b", "a", "a", "c", "a", "b"], ordered=False, sort_gb=True)
>>> c
Categorical([b, a, a, c, a, b]) Length: 6
FastArray([1, 2, 2, 3, 2, 1], dtype=int8) Base Index: 1
FastArray([b'b', b'a', b'c'], dtype='|S1') Unique count: 3
>>> c.sum(vals)
*key_0 col_0
------ -----
a 7
b 5
c 3
Categorical created from values and user-provided categories (unsorted)
If categories are provided, they are always held in the same order. The
ordered
argument is ignored, and you can’t setlex=True
.Groupby results can be displayed in sorted order with
sort_gb=True
.
Categorical with unsorted categories:
>>> c = rt.Categorical(["b", "a", "a", "c", "a", "b"], categories=["b", "a", "c"])
>>> c
Categorical([b, a, a, c, a, b]) Length: 6
FastArray([1, 2, 2, 3, 2, 1], dtype=int8) Base Index: 1
FastArray([b'b', b'a', b'c'], dtype='|S1') Unique count: 3
Groupby results are in the order of provided categories:
>>> vals = rt.arange(6)
>>> c.sum(vals)
*key_0 col_0
------ -----
b 5
a 7
c 3
With provided categories, lex
can’t be set to True
:
>>> try:
... rt.Categorical(["b", "a", "a", "c", "a", "b"], categories=["b", "a", "c"], lex=True)
... except TypeError as e:
... print("TypeError:", e)
TypeError: Cannot bin using lexsort and user-suplied categories.
With sort_gb=True
, categories are held in the order provided but displayed
lexicographically sorted in groupby results:
>>> c = rt.Categorical(["b", "a", "a", "c", "a", "b"], categories=["b", "a", "c"], sort_gb=True)
>>> c
Categorical([b, a, a, c, a, b]) Length: 6
FastArray([1, 2, 2, 3, 2, 1], dtype=int8) Base Index: 1
FastArray([b'b', b'a', b'c'], dtype='|S1') Unique count: 3
>>> c.sum(vals)
*key_0 col_0
------ -----
a 7
b 5
c 3
If the categories are provided in a mapping dictionary or IntEnum
, the
groupby results are in the order of the underlying mapping codes array unless
sort_gb=True
:
>>> d = {"StronglyAgree": 44, "Agree": 133, "Disagree": 75, "StronglyDisagree": 1, "NeitherAgreeNorDisagree": 144 }
>>> codes = [1, 44, 44, 133, 75] # Note duplication and missing entry.
>>> c = rt.Categorical(codes, categories=d)
>>> c
Categorical([StronglyDisagree, StronglyAgree, StronglyAgree, Agree, Disagree]) Length: 5
FastArray([ 1, 44, 44, 133, 75]) Base Index: None
{44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 4
>>> vals = rt.arange(5)
>>> ds = rt.Dataset({"c": c, "vals": vals})
>>> ds
# c vals
- ---------------- ----
0 StronglyDisagree 0
1 StronglyAgree 1
2 StronglyAgree 2
3 Agree 3
4 Disagree 4
>>> c.sum(vals)
*c vals
--------------- ----
StronglyDisagre 0
StronglyAgree 3
Agree 3
Disagree 4
With sort_gb=True
, categories are displayed lexicographically sorted in groupby
results:
>>> c = rt.Categorical(codes, categories=d, sort_gb=True)
>>> c
Categorical([StronglyDisagree, StronglyAgree, StronglyAgree, Agree, Disagree]) Length: 5
FastArray([ 1, 44, 44, 133, 75]) Base Index: None
{44:'StronglyAgree', 133:'Agree', 75:'Disagree', 1:'StronglyDisagree', 144:'NeitherAgreeNorDisagree'} Unique count: 4
>>> c.sum(vals)
*key_0 vals
--------------- ----
Agree 3
Disagree 4
StronglyAgree 3
StronglyDisagre 0
Ordering of results from rt.cut and rt.qcut operations
With cut
and qcut
, when labels are provided they are held and displayed
in the order of first appearance and are considered ordered in the context of
logical comparisons:
>>> c = rt.cut(rt.arange(10), bins=3, labels=["z-label1", "y-label2", "x-label3"])
>>> c
Categorical([z-label1, z-label1, z-label1, z-label1, y-label2, y-label2, y-label2, x-label3, x-label3, x-label3]) Length: 10
FastArray([1, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int8) Base Index: 1
FastArray([b'z-label1', b'y-label2', b'x-label3'], dtype='|S8') Unique count: 3
>>> c.sum(rt.arange(10))
*key_0 col_0
-------- -----
z-label1 6
y-label2 15
x-label3 24
>>> c > "z-label1"
FastArray([False, False, False, False, True, True, True, True, True, True])
See Comparisons for more examples.