Perform Group Operations with Categoricals

Riptable Categoricals have two related uses:

  • They efficiently store string (or other large dtype) arrays that have repeated values. The repeated values are partitioned into groups (a.k.a. categories), and each group is mapped to an integer. For example, in a Categorical that contains three ‘AAPL’ symbols and four ‘MSFT’ symbols, the data is partitioned into an ‘AAPL’ group that’s mapped to 1 and a ‘MSFT’ group that’s mapped to 2. This integer mapping allows the data to be stored and operated on more efficiently.

  • They’re Riptable’s class for doing group operations. A method applied to a Categorical is applied to each group separately.

We’ll talk about group operations first, then look at how Categoricals store data under the hood.

Here’s a simple Dataset with repeated stock symbols and some values:

>>> ds = rt.Dataset()
>>> ds.Symbol = rt.FA(['AAPL', 'MSFT', 'AAPL', 'TSLA', 'MSFT', 'TSLA'])
>>> ds.Value = rt.FA([5, 10, 15, 20, 25, 30])
>>> ds
#   Symbol   Value
-   ------   -----
0   AAPL         5
1   MSFT        10
2   AAPL        15
3   TSLA        20
4   MSFT        25
5   TSLA        30

Categoricals for Group Operations

We know how to get the sum of the Value column:

>>> ds.Value.sum()
105

Categoricals make it just as easy to get the sum for each symbol.

Use the Categorical constructor to turn the Symbol column into a Categorical:

>>> ds.Symbol = rt.Categorical(ds.Symbol)  # Note: rt.Cat() also works.

Now we call the sum() method on the Categorical, passing it the data we want to sum for each group:

>>> ds.Symbol.sum(ds.Value)
*Symbol   Value
-------   -----
AAPL         20
MSFT         35
TSLA         50

A Dataset is returned containing the groups from the Categorical and the result of the operation we called on each group.

Note the prepended ‘*’ in the Symbol column. This indicates that the column was used as the grouping variable in an operation.

Categoricals as Split, Apply, Combine Operations

Hadley Wickham, known for his work on Rstats, described the operation (also known as a “group by” operation) as split, apply, combine.

The illustration below shows how the groups are split based on the “keys” (or, in Riptable’s case, the Categorical values). The sum method is then applied to each group separately, and the results are combined into an output array.

The split-apply-combine operation

Operations Supported by Categoricals

Categoricals support most common reducing functions, which return one value per group. Some of the more common ones:

Reducing Function

Description

count()

Total number of items

first(), last()

First item, last item

mean(), median()

Mean, median

min(), max()

Minimum, maximum

std(), var()

Standard deviation, variance

prod()

Product of all items

sum()

Sum of all items

Here’s the complete list of Categorical reducing functions.

Categoricals also support non-reducing functions. But because non-reducing functions return one value for each value in the original data, the results are a little different.

For example, take cumsum(), which is a running total.

When it’s applied to a Categorical, the function does get applied to each group separately. However, the returned Dataset has one result per value of the original data:

>>> ds.Value2 = rt.FA([2, 10, 5, 25, 8, 20])
>>> ds.Symbol.cumsum(ds.Value2)
#   Value2
-   ------
0        2
1       10
2        7
3       25
4       18
5       45

The alignment of the result to the original data is easier to see if you add the results to the Dataset:

>>> ds.CumValue2 = ds.Symbol.cumsum(ds.Value2)
>>> # Sort to make the cumulative sum per group more clear, then display only the relevant columns.
>>> ds.sort_copy('Symbol').col_filter(['Symbol', 'Value2', 'CumValue2'])
#   Symbol   Value2   CumValue2
-   ------   ------   ---------
0   AAPL          2           2
1   AAPL          5           7
2   MSFT         10          10
3   MSFT          8          18
4   TSLA         25          25
5   TSLA         20          45

A commonly used non-reducing function is shift(). You can use it to compare values with shifted versions of themselves – for example, today’s price compared to yesterday’s price, the volume compared to the volume an hour ago, etc.

Where a category has no previous value to shift forward, the missing value is filled with an invalid value (e.g., Inv for integers or nan for floats):

>>> ds.PrevValue = ds.Symbol.shift(ds.Value)
>>> ds.col_filter(['Symbol', 'Value', 'PrevValue'])
#   Symbol   Value   PrevValue
-   ------   -----   ---------
0   AAPL         5         Inv
1   MSFT        10         Inv
2   AAPL        15           5
3   TSLA        20         Inv
4   MSFT        25          10
5   TSLA        30          20

Other non-reducing functions include rolling_sum(), rolling_mean() and their nan-versions rolling_nansum() and rolling_nanmean(), and cumsum() and cumprod().

Other functions not listed here can also be applied to Categoricals, including lambda functions and other user-defined functions, with the help of apply(). More on that below.

Expand the Results of Reducing Operations with transform

Notice that if we try to add the result of a reducing operation to a Dataset, Riptable complains that the result isn’t the right length:

>>> try:
...     ds.Mean = ds.Symbol.mean(ds.Value)
... except TypeError as e:
...     print("TypeError:", e)
TypeError: ('Row mismatch in Dataset._check_addtype.  Tried to add Dataset of different lengths', 6, 3)

You can expand the result of a reducing function so that it’s aligned with the original data by passing transform=True to the function:

>>> ds.MaxValue = ds.Symbol.max(ds.Value, transform=True)
>>> ds.sort_copy(['Symbol', 'Value']).col_filter(['Symbol', 'Value', 'MaxValue'])
#   Symbol   Value   MaxValue
-   ------   -----   --------
0   AAPL         5         15
1   AAPL        15         15
2   MSFT        10         25
3   MSFT        25         25
4   TSLA        20         30
5   TSLA        30         30

The max value per symbol is repeated for every instance of the symbol.

Apply an Operation to Multiple Columns or a Dataset

You can apply a function to multiple columns by passing a list of column names. Here’s a reducing function applied to two columns:

>>> ds.Value3 = ds.Value * 2  # Add another column of data.
>>> ds.Symbol.max([ds.Value, ds.Value3])
*Symbol   Value   Value3
-------   -----   ------
AAPL         15       30
MSFT         25       50
TSLA         30       60

Note the syntax for adding the results of an operation on two columns to a Dataset. To be the right length for the Dataset, the results have to be from a non-reducing function or a reducing function that has transform=True:

>>> ds[['MaxValue', 'MaxValue3']] = ds.Symbol.max([ds.Value, ds.Value3],
...                                               transform=True)[['Value', 'Value3']]
#   Symbol   Value   Value3   MaxValue   MaxValue3
-   ------   -----   ------   --------   ---------
0   AAPL         5       10         15          30
1   AAPL        15       30         15          30
2   MSFT        10       20         25          50
3   MSFT        25       50         25          50
4   TSLA        20       40         30          60
5   TSLA        30       60         30          60

You can also apply a function to a whole Dataset. Any column for which the function fails – for example, a numerical function on a string column – is not returned:

>>> ds.OptionType = list("PC")*3  # Add a string column.
>>> ds.Symbol.max(ds)
*Symbol   Value   CumValue   Value3   MaxValue   MaxValue3
-------   -----   --------   ------   --------   ---------
AAPL         15         20       30         15          30
MSFT         25         35       50         25          50
TSLA         30         50       60         30          60

Categoricals for Storing Strings

To get a better sense of how Categoricals store data, let’s look at one under the hood:

>>> ds.Symbol
Categorical([AAPL, MSFT, AAPL, TSLA, MSFT, TSLA]) Length: 6
  FastArray([1, 2, 1, 3, 2, 3], dtype=int8) Base Index: 1
  FastArray([b'AAPL', b'MSFT', b'TSLA'], dtype='|S4') Unique count: 3

The first line shows the 6 symbols. You can access the array with expand_array:

>>> ds.Symbol.expand_array
FastArray([b'AAPL', b'MSFT', b'AAPL', b'TSLA', b'MSFT', b'TSLA'],
          dtype='|S8')

The second line is a FastArray of integers, with one integer for each unique category of the Categorical. It’s accessible with _fa:

>>> ds.Symbol._fa
FastArray([1, 2, 1, 3, 2, 3], dtype=int8)

The list of unique categories is shown in the third line. You can access the list with category_array:

>>> ds.Symbol.category_array
FastArray([b'AAPL', b'MSFT', b'TSLA'], dtype='|S4')

It’s the same thing we get if we do:

>>> ds.Symbol.unique()
FastArray([b'AAPL', b'MSFT', b'TSLA'], dtype='|S4')

We can get a better picture of the mapping by putting the integer FastArray into the Dataset:

>>> ds.Mapping = ds.Symbol._fa
>>> ds.col_filter(['Symbol', 'Mapping'])
#   Symbol   Mapping
-   ------   -------
0   AAPL           1
1   MSFT           2
2   AAPL           1
3   TSLA           3
4   MSFT           2
5   TSLA           3

Because it’s much more efficient to pass around integers than it is to pass around strings, it’s common for string data with repeated values to be stored using integer mapping.

If you have data stored as integers (for example, datetime data), you can create a Categorical using the integer array and an array of unique categories:

>>> c = rt.Categorical([1, 3, 2, 2, 1, 3, 3, 1], categories=['a','b','c'])
>>> c
Categorical([a, c, b, b, a, c, c, a]) Length: 8
  FastArray([1, 3, 2, 2, 1, 3, 3, 1]) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

Notice that in this Categorical and the one we created above, the base index is 1, not 0.

This brings us to an important note about Categoricals: By default, the base index is 1; 0 is reserved for holding any values of the Categorical that are filtered out of operations on the Categorical.

Values can be filtered out of all operations or specific ones.

Filter Values or Categories from All Categorical Operations

When you create a Categorical, you can filter certain values or entire categories from all operations on it.

We’ll start with filtering values. Say we have a Dataset with symbols ‘A’ and ‘B’ that are associated with exchanges ‘X’, ‘Y’, and ‘Z’.

>>> # Create the Dataset.
>>> rng = np.random.default_rng(seed=42)
>>> N = 25
>>> symbol_exchange = rt.Dataset()
>>> symbol_exchange.Symbol = rt.FA(rng.choice(['A', 'B'], N))
>>> symbol_exchange.Exchange = rt.FA(rng.choice(['X', 'Y', 'Z'], N))
>>> symbol_exchange
 #   Symbol   Exchange
--   ------   --------
 0   B        Y
 1   A        X
 2   B        Z
 3   A        Y
 4   B        Y
 5   B        Y
 6   B        Y
 7   B        X
 8   A        Y
 9   B        Z
10   A        X
11   B        Y
12   B        X
13   A        Y
14   B        Y
15   B        Z
16   A        X
17   A        X
18   A        Y
19   A        Z
20   A        Z
21   B        X
22   B        X
23   A        Z
24   B        X

We want to make the Symbol column a Categorical, but we’re interested in only the symbol values that are associated with the ‘X’ exchange.

When we create the Categorical, we can use the filter keyword argument with a Boolean mask array that’s True for symbol values associated with the ‘X’ exchange:

>>> exchangeX = symbol_exchange.Exchange == 'X'  # Create a mask array.
>>> c_x = rt.Cat(symbol_exchange.Symbol, filter=exchangeX)

When we view the Categorical, we can see that symbol values associated with exchanges ‘Y’ and ‘Z’ are shown as ‘Filtered’, and the ‘Filtered’ values are mapped to the 0 index in the integer array:

>>> c_x
Categorical([Filtered, A, Filtered, Filtered, Filtered, ..., Filtered, B, B, Filtered, B]) Length: 25
  FastArray([0, 1, 0, 0, 0, ..., 0, 2, 2, 0, 2], dtype=int8) Base Index: 1
  FastArray([b'A', b'B'], dtype='|S1') Unique count: 2

To better see what’s filtered, we can add it to the Dataset:

>>> symbol_exchange.Filtered = c_x
 #   Symbol   Exchange   Filtered
--   ------   --------   --------
 0   B        Y          Filtered
 1   A        X          A
 2   B        Z          Filtered
 3   A        Y          Filtered
 4   B        Y          Filtered
 5   B        Y          Filtered
 6   B        Y          Filtered
 7   B        X          B
 8   A        Y          Filtered
 9   B        Z          Filtered
10   A        X          A
11   B        Y          Filtered
12   B        X          B
13   A        Y          Filtered
14   B        Y          Filtered
15   B        Z          Filtered
16   A        X          A
17   A        X          A
18   A        Y          Filtered
19   A        Z          Filtered
20   A        Z          Filtered
21   B        X          B
22   B        X          B
23   A        Z          Filtered
24   B        X          B

Now, a group operation applied to the Categorical omits the filtered values:

>>> c_x.count()
*Symbol   Count
-------   -----
A             4
B             5

Filtering out an entire category (here, the ‘A’ symbol) is similar:

>>> f_A = symbol_exchange.Symbol != 'A'
>>> c_b = rt.Categorical(symbol_exchange.Symbol, filter=f_A)
>>> c_b
Categorical([B, Filtered, B, Filtered, B, ..., Filtered, B, B, Filtered, B]) Length: 25
  FastArray([1, 0, 1, 0, 1, ..., 0, 1, 1, 0, 1], dtype=int8) Base Index: 1
  FastArray([b'B'], dtype='|S1') Unique count: 1

The filtered category is entirely omitted from operations:

>>> c_b.count()
*Symbol   Count
-------   -----
B            14

If you’re creating a Categorical from integers and provided categories, another way to filter a category is to map it to 0.

Because 0 is reserved for the Filtered category, here ‘a’ is mapped to 1 and ‘b’ is mapped to 2. And because there’s no 3 to map to ‘c’, ‘c’ becomes Filtered:

>>> c1 = rt.Categorical([0, 2, 1, 1, 0, 2, 2, 0], categories=['a','b','c'])
>>> c1
Categorical([Filtered, b, a, a, Filtered, b, b, Filtered]) Length: 8
  FastArray([0, 2, 1, 1, 0, 2, 2, 0]) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

In this case, the filtered category appears in the result, but it’s still omitted from calculations on the Categorical:

>>> c1.count()
*key_0   Count
------   -----
a            2
b            3
c            0

Note that the first column in the output is labeled ‘key_0’. This was code-generated because there was no explicit column name declaration. You can use the FastArray.set_name() method to assign a column name to the Categorical before doing any grouping operations. The Count column was created by the count() method.

Filter Values or Categories from Certain Categorical Operations

It’s also possible to filter values for only a certain operation.

In Get and Operate on Subsets of Data Using Filters, we saw that many operations called on FastArrays / Dataset columns take a filter keyword argument that limits the data operated on:

>>> a = rt.FA([1, 2, 3, 4, 5])
>>> a.mean(filter=a > 2)
4.0

It’s similar with Categoricals:

>>> Symbol = rt.Cat(rt.FA(['AAPL', 'MSFT', 'AAPL', 'TSLA', 'MSFT', 'TSLA']))
>>> Value = rt.FA([5, 10, 15, 20, 25, 30])
>>> Symbol.mean(Value, filter=Value > 20.0)
*key_0   col_0
------   -----
AAPL       nan
MSFT     25.00
TSLA     30.00

The data that doesn’t meet the condition is omitted from the computation for only that operation.

To filter out an entire category:

>>> ds.Symbol.mean(ds.Value, filter=ds.Symbol != 'MSFT')
*Symbol   Value
-------   -----
AAPL      10.00
MSFT        nan
TSLA      25.00

In this case, the filtered category is shown, but the result of the operation on its values is NaN.

If you want to make sure your filter is doing what you intend before you apply a function to the filtered data, you can call set_valid() on the Categorical.

Calling set_valid() on a Categorical returns a Categorical of the same length in which everywhere the filter result is False, the category gets set to ‘Filtered’ and the associated index value is 0. This is in contrast to filtered Datasets, where filter() returns a smaller Dataset, reduced to only the rows where the filter result is True (where the filter condition is met).

>>> Symbol.set_valid(ds.Value > 20.0)
Categorical([Filtered, Filtered, Filtered, Filtered, MSFT, TSLA]) Length: 6
  FastArray([0, 0, 0, 0, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'MSFT', b'TSLA'], dtype='|S4') Unique count: 2

To more closely spot-check, put the filtered values in a Dataset:

>>> ds_test = rt.Dataset()
>>> ds_test.SymbolTest = ds.Symbol.set_valid(ds.Value > 20.0)
>>> ds_test.ValueTest = ds.Value
>>> ds_test
#   SymbolTest   ValueTest
-   ----------   ---------
0   Filtered             5
1   Filtered            10
2   Filtered            15
3   Filtered            20
4   MSFT                25
5   TSLA                30

The advice to avoid making unnecessary copies of large amounts of data using set_valid() also applies to Categoricals.

Multi-Key Categoricals

Multi-key Categoricals let you create and operate on groupings based on two related categories.

An example is a symbol-month pair, which you could use to get the average value of a stock for each month in your data:

>>> ds_mk = rt.Dataset()
>>> N = 25
>>> ds_mk.Symbol = rt.FA(rng.choice(['AAPL', 'AMZN', 'MSFT'], N))
>>> ds_mk.Value = rt.FA(rng.random(N))
>>> ds_mk.Date = rt.Date.range('20210101', '20211231', step=15)
>>> ds_mk
 #   Symbol   Value         Date
--   ------   -----   ----------
 0   AAPL      0.59   2021-01-01
 1   MSFT      0.78   2021-01-16
 2   AAPL      0.80   2021-01-31
 3   AAPL      0.95   2021-02-15
 4   AMZN      0.25   2021-03-02
 5   MSFT      0.59   2021-03-17
 6   AMZN      0.10   2021-04-01
 7   MSFT      0.62   2021-04-16
 8   MSFT      0.17   2021-05-01
 9   AAPL      0.56   2021-05-16
10   MSFT      0.57   2021-05-31
11   AMZN      0.47   2021-06-15
12   AMZN      0.52   2021-06-30
13   AAPL      0.76   2021-07-15
14   AMZN      0.80   2021-07-30
15   MSFT      0.49   2021-08-14
16   AMZN      0.60   2021-08-29
17   AAPL      0.93   2021-09-13
18   AMZN      0.12   2021-09-28
19   MSFT      0.12   2021-10-13
20   MSFT      0.09   2021-10-28
21   AAPL      0.66   2021-11-12
22   MSFT      0.42   2021-11-27
23   MSFT      0.77   2021-12-12
24   AAPL      0.67   2021-12-27

We want to group the dates by month. An easy way to do this is by using start_of_month:

>>> ds_mk.Month = ds_mk.Date.start_of_month
>>> ds_mk
#   Symbol   Value         Date        Month
--   ------   -----   ----------   ----------
 0   AAPL      0.59   2021-01-01   2021-01-01
 1   MSFT      0.78   2021-01-16   2021-01-01
 2   AAPL      0.80   2021-01-31   2021-01-01
 3   AAPL      0.95   2021-02-15   2021-02-01
 4   AMZN      0.25   2021-03-02   2021-03-01
 5   MSFT      0.59   2021-03-17   2021-03-01
 6   AMZN      0.10   2021-04-01   2021-04-01
 7   MSFT      0.62   2021-04-16   2021-04-01
 8   MSFT      0.17   2021-05-01   2021-05-01
 9   AAPL      0.56   2021-05-16   2021-05-01
10   MSFT      0.57   2021-05-31   2021-05-01
11   AMZN      0.47   2021-06-15   2021-06-01
12   AMZN      0.52   2021-06-30   2021-06-01
13   AAPL      0.76   2021-07-15   2021-07-01
14   AMZN      0.80   2021-07-30   2021-07-01
15   MSFT      0.49   2021-08-14   2021-08-01
16   AMZN      0.60   2021-08-29   2021-08-01
17   AAPL      0.93   2021-09-13   2021-09-01
18   AMZN      0.12   2021-09-28   2021-09-01
19   MSFT      0.12   2021-10-13   2021-10-01
20   MSFT      0.09   2021-10-28   2021-10-01
21   AAPL      0.66   2021-11-12   2021-11-01
22   MSFT      0.42   2021-11-27   2021-11-01
23   MSFT      0.77   2021-12-12   2021-12-01
24   AAPL      0.67   2021-12-27   2021-12-01

Now all Dates in January are associated to 2021-01-01, all Dates in February are associated to 2021-02-01, etc. These firsts of the month are our month groups.

We create a multi-key Categorical by passing rt.Cat() the Symbol and Month columns:

>>> ds_mk.Symbol_Month = rt.Cat([ds_mk.Symbol, ds_mk.Month])
>>> ds_mk.Symbol_Month
Categorical([(AAPL, 2021-01-01), (MSFT, 2021-01-01), (AAPL, 2021-01-01), (AAPL, 2021-02-01), (AMZN, 2021-03-01), ..., (MSFT, 2021-10-01), (AAPL, 2021-11-01), (MSFT, 2021-11-01), (MSFT, 2021-12-01), (AAPL, 2021-12-01)]) Length: 25
  FastArray([ 1,  2,  1,  3,  4, ..., 17, 18, 19, 20, 21], dtype=int8) Base Index: 1
  {'Symbol': FastArray([b'AAPL', b'MSFT', b'AAPL', b'AMZN', b'MSFT', ..., b'MSFT', b'AAPL', b'MSFT', b'MSFT', b'AAPL'], dtype='|S4'), 'Month': Date(['2021-01-01', '2021-01-01', '2021-02-01', '2021-03-01', '2021-03-01', ..., '2021-10-01', '2021-11-01', '2021-11-01', '2021-12-01', '2021-12-01'])} Unique count: 21

And now we can get the average value for each symbol-month pair:

>>> ds_mk.Symbol_Month.mean(ds_mk.Value)
*Symbol       *Month   Value
-------   ----------   -----
AAPL      2021-01-01    0.69
MSFT      2021-01-01    0.78
AAPL      2021-02-01    0.95
AMZN      2021-03-01    0.25
MSFT      2021-03-01    0.59
AMZN      2021-04-01    0.10
MSFT      2021-04-01    0.62
.         2021-05-01    0.37
AAPL      2021-05-01    0.56
AMZN      2021-06-01    0.49
AAPL      2021-07-01    0.76
AMZN      2021-07-01    0.80
MSFT      2021-08-01    0.49
AMZN      2021-08-01    0.60
AAPL      2021-09-01    0.93
AMZN      2021-09-01    0.12
MSFT      2021-10-01    0.10
AAPL      2021-11-01    0.66
MSFT      2021-11-01    0.42
.         2021-12-01    0.77
AAPL      2021-12-01    0.67

The aggregated results are presented with the two group keys arranged hierarchically. The dot indicates that the category above is repeated.

All the functions supported by Categoricals can also be used for multi-key Categoricals.

You can also filter multi-key Categoricals by calling set_valid() on the Categorical, and operate on filtered data by passing the filter keyword argument to the function you use.

Later on we’ll cover another Riptable function, Accum2(), that aggregates two groups similarly but provides summary data and a styled output.

Partition Numeric Data into Bins for Analysis

When you have a large array of numeric data, rt.cut() and rt.qcut() can help you partition the values into Categorical bins for analysis.

Use cut() to create equal-width bins or bins defined by specified endpoints. Use qcut() to create bins based on sample quantiles.

Let’s use a moderately large Dataset:

>>> N = 1_000
>>> ds2 = rt.Dataset()
>>> ds2.Symbol = rt.FA(rng.choice(['AAPL', 'AMZN', 'MSFT'], N))
>>> base_price = 100 + rt.FA(np.linspace(0, 900, N))
>>> noise = rt.FA(rng.normal(0, 50, N))
>>> ds2.Price = base_price + noise
>>> ds2
  #   Symbol      Price
---   ------   --------
  0   AMZN        93.87
  1   AMZN       150.69
  2   AAPL       154.76
  3   MSFT       153.99
  4   AMZN       105.55
  5   AMZN        62.25
  6   MSFT        51.22
  7   AMZN       123.54
  8   AAPL       126.17
  9   AAPL       172.47
 10   AAPL       164.01
 11   MSFT       103.30
 12   AAPL        48.60
 13   AAPL        95.76
 14   AMZN       123.47
...   ...           ...
985   AMZN     1,027.85
986   AAPL       993.06
987   AMZN       867.37
988   AAPL       940.92
989   AAPL     1,025.38
990   MSFT     1,052.54
991   AAPL     1,048.25
992   AMZN       914.09
993   AMZN     1,009.67
994   AAPL     1,046.27
995   AAPL       913.48
996   AMZN       996.90
997   AMZN     1,011.89
998   MSFT       984.06
999   MSFT       907.39

Create equal-width bins with rt.cut()

To partition values into equal-width bins, use cut() and specify the number of bins:

>>> ds2.PriceBin = rt.cut(ds2.Price, bins=5)
>>> ds2
  #   Symbol      Price   PriceBin
---   ------   --------   -----------------
  0   AMZN        93.87   -3.011->221.182
  1   AMZN       150.69   -3.011->221.182
  2   AAPL       154.76   -3.011->221.182
  3   MSFT       153.99   -3.011->221.182
  4   AMZN       105.55   -3.011->221.182
  5   AMZN        62.25   -3.011->221.182
  6   MSFT        51.22   -3.011->221.182
  7   AMZN       123.54   -3.011->221.182
  8   AAPL       126.17   -3.011->221.182
  9   AAPL       172.47   -3.011->221.182
  10  AAPL       164.01   -3.011->221.182
  11  MSFT       103.30   -3.011->221.182
  12  AAPL        48.60   -3.011->221.182
  13  AAPL        95.76   -3.011->221.182
  14  AMZN       123.47   -3.011->221.182
 ...   ...          ...   ...
 985  AMZN     1,027.85   893.763->1117.956
 986  AAPL       993.06   893.763->1117.956
 987  AMZN       867.37   669.569->893.763
 988  AAPL       940.92   893.763->1117.956
 989  AAPL     1,025.38   893.763->1117.956
 990  MSFT     1,052.54   893.763->1117.956
 991  AAPL     1,048.25   893.763->1117.956
 992  AMZN       914.09   893.763->1117.956
 993  AMZN     1,009.67   893.763->1117.956
 994  AAPL     1,046.27   893.763->1117.956
 995  AAPL       913.48   893.763->1117.956
 996  AMZN       996.90   893.763->1117.956
 997  AMZN     1,011.89   893.763->1117.956
 998  MSFT       984.06   893.763->1117.956
 999  MSFT       907.39   893.763->1117.956

Notice that the bins form the categories of a Categorical:

>>> ds2.PriceBin
Categorical([-3.011->221.182, -3.011->221.182, -3.011->221.182, -3.011->221.182, -3.011->221.182, ..., 893.763->1117.956, 893.763->1117.956, 893.763->1117.956, 893.763->1117.956, 893.763->1117.956]) Length: 1000
  FastArray([1, 1, 1, 1, 1, ..., 5, 5, 5, 5, 5], dtype=int8) Base Index: 1
  FastArray([b'-3.011->221.182', b'221.182->445.376', b'445.376->669.569', b'669.569->893.763', b'893.763->1117.956'], dtype='|S17') Unique count: 5

To specify your own bin endpoints, provide an array. Here, we define two bins: one for prices from 0 to 600 (with both endpoints, 0 and 600, included), and one for prices from 600 to 1,200 (600 excluded, 1,200 included):

>>> bins = [0, 600, 1200]
>>> ds2.PriceBin2 = rt.cut(ds2.Price, bins)
>>> ds2
  #   Symbol      Price   PriceBin            PriceBin2
---   ------   --------   -----------------   -------------
  0   AMZN        93.87   -3.011->221.182     0.0->600.0
  1   AMZN       150.69   -3.011->221.182     0.0->600.0
  2   AAPL       154.76   -3.011->221.182     0.0->600.0
  3   MSFT       153.99   -3.011->221.182     0.0->600.0
  4   AMZN       105.55   -3.011->221.182     0.0->600.0
  5   AMZN        62.25   -3.011->221.182     0.0->600.0
  6   MSFT        51.22   -3.011->221.182     0.0->600.0
  7   AMZN       123.54   -3.011->221.182     0.0->600.0
  8   AAPL       126.17   -3.011->221.182     0.0->600.0
  9   AAPL       172.47   -3.011->221.182     0.0->600.0
 10   AAPL       164.01   -3.011->221.182     0.0->600.0
 11   MSFT       103.30   -3.011->221.182     0.0->600.0
 12   AAPL        48.60   -3.011->221.182     0.0->600.0
 13   AAPL        95.76   -3.011->221.182     0.0->600.0
 14   AMZN       123.47   -3.011->221.182     0.0->600.0
...   ...           ...   ...                 ...
985   AMZN     1,027.85   893.763->1117.956   600.0->1200.0
986   AAPL       993.06   893.763->1117.956   600.0->1200.0
987   AMZN       867.37   669.569->893.763    600.0->1200.0
988   AAPL       940.92   893.763->1117.956   600.0->1200.0
989   AAPL     1,025.38   893.763->1117.956   600.0->1200.0
990   MSFT     1,052.54   893.763->1117.956   600.0->1200.0
991   AAPL     1,048.25   893.763->1117.956   600.0->1200.0
992   AMZN       914.09   893.763->1117.956   600.0->1200.0
993   AMZN     1,009.67   893.763->1117.956   600.0->1200.0
994   AAPL     1,046.27   893.763->1117.956   600.0->1200.0
995   AAPL       913.48   893.763->1117.956   600.0->1200.0
996   AMZN       996.90   893.763->1117.956   600.0->1200.0
997   AMZN     1,011.89   893.763->1117.956   600.0->1200.0
998   MSFT       984.06   893.763->1117.956   600.0->1200.0
999   MSFT       907.39   893.763->1117.956   600.0->1200.0

Create bins based on sample quantiles with rt.qcut()

To partition values into bins based on sample quantiles, use qcut().

We’ll create another Dataset with symbol groups and contracts per day:

>>> N = 1_000
>>> ds3 = rt.Dataset()
>>> ds3.SymbolGroup = rt.FA(rng.choice(['spx', 'eqt_comp', 'eqt300', 'eqtrest'], N))
>>> ds3.ContractsPerDay = rng.integers(low=0, high=5_000, size=N)
>>> ds3.head()
 #   SymbolGroup   ContractsPerDay
--   -----------   ---------------
 0   eqt300                  1,624
 1   spx                       851
 2   spx                     3,487
 3   eqt300                    345
 4   eqtrest                 2,584
 5   spx                     3,639
 6   spx                     4,741
 7   eqtrest                 1,440
 8   eqtrest                    39
 9   spx                     3,618
10   eqt_comp                    7
11   eqt300                    331
12   spx                     4,952
13   eqt_comp                4,312
14   eqt_comp                3,537
15   eqt300                  4,177
16   eqt_comp                  376
17   eqt_comp                  444
18   eqt_comp                1,504
19   eqtrest                   118

Create three labeled, quantile-based bins for the volume:

>>> label_names = ['Low', 'Medium', 'High']
>>> ds3.Volume = rt.qcut(ds3.ContractsPerDay, q=3, labels=label_names)
>>> ds3.head()
 #   SymbolGroup   ContractsPerDay   Volume
--   -----------   ---------------   ------
 0   eqt300                  1,624   Low
 1   spx                       851   Low
 2   spx                     3,487   High
 3   eqt300                    345   Low
 4   eqtrest                 2,584   Medium
 5   spx                     3,639   High
 6   spx                     4,741   High
 7   eqtrest                 1,440   Low
 8   eqtrest                    39   Low
 9   spx                     3,618   High
10   eqt_comp                    7   Low
11   eqt300                    331   Low
12   spx                     4,952   High
13   eqt_comp                4,312   High
14   eqt_comp                3,537   High
15   eqt300                  4,177   High
16   eqt_comp                  376   Low
17   eqt_comp                  444   Low
18   eqt_comp                1,504   Low
19   eqtrest                   118   Low

As with cut(), the bins form the categories of a Categorical:

>>> ds3.Volume
Categorical([High, High, Medium, High, Low, ..., Low, Medium, High, Low, Low]) Length: 1000
  FastArray([4, 4, 3, 4, 2, ..., 2, 3, 4, 2, 2], dtype=int8) Base Index: 1
  FastArray([b'Clipped', b'Low', b'Medium', b'High'], dtype='|S7') Unique count: 4

The ‘Clipped’ bin is created to hold any out-of-bounds values (though there are none in this case).

Alternatively, you can give qcut() a list of quantiles (numbers between 0 and 1, inclusive). Here, we create quartiles:

>>> quartiles = [0, .25, .5, .75, 1.]
>>> ds3.VolQuartiles = rt.qcut(ds3.ContractsPerDay, q=quartiles)
>>> ds3.head()
 #   SymbolGroup   ContractsPerDay   Volume   VolQuartiles
--   -----------   ---------------   ------   ---------------
 0   eqt300                  1,624   Low      1273.75->2601.0
 1   spx                       851   Low      0.0->1273.75
 2   spx                     3,487   High     2601.0->3793.0
 3   eqt300                    345   Low      0.0->1273.75
 4   eqtrest                 2,584   Medium   1273.75->2601.0
 5   spx                     3,639   High     2601.0->3793.0
 6   spx                     4,741   High     3793.0->4991.0
 7   eqtrest                 1,440   Low      1273.75->2601.0
 8   eqtrest                    39   Low      0.0->1273.75
 9   spx                     3,618   High     2601.0->3793.0
10   eqt_comp                    7   Low      0.0->1273.75
11   eqt300                    331   Low      0.0->1273.75
12   spx                     4,952   High     3793.0->4991.0
13   eqt_comp                4,312   High     3793.0->4991.0
14   eqt_comp                3,537   High     2601.0->3793.0
15   eqt300                  4,177   High     3793.0->4991.0
16   eqt_comp                  376   Low      0.0->1273.75
17   eqt_comp                  444   Low      0.0->1273.75
18   eqt_comp                1,504   Low      1273.75->2601.0
19   eqtrest                   118   Low      0.0->1273.75

Per-Group Calculations with Other Functions

Categoricals support most common functions. For functions that aren’t supported (for example, a function you’ve written), you can use apply_reduce() to apply a reducing function and apply_nonreduce() to apply a non-reducing function.

apply_reduce()

The function you use with apply_reduce() can take in one or multiple columns/FastArrays as input, but it must return a single value per group.

To illustrate, we’ll use apply_reduce() with two simple lambda functions that each return one value. (A lambda function is an anonymous function that consists of a single statement and gives back a return value. When you have a function that takes a function as an argument, using a lambda function as the argument can sometimes be simpler and clearer than defining a function separately.)

First, we’ll create a new Dataset:

>>> N = 50
>>> ds = rt.Dataset()
>>> ds.Symbol = rt.Cat(rng.choice(['AAPL', 'AMZN', 'TSLA', 'SPY', 'GME'], N))
>>> ds.Value = rng.random(N) * 100
>>> ds.Value2 = ds.Value * 2
>>> ds.sample()
#   Symbol   Value   Value2
-   ------   -----   ------
0   SPY      41.04    82.09
1   TSLA     93.07   186.14
2   AMZN      2.03     4.05
3   AAPL     16.19    32.37
4   AMZN      2.42     4.85
5   TSLA     98.13   196.26
6   SPY      98.67   197.34
7   SPY      62.31   124.61
8   TSLA     96.79   193.58
9   TSLA     67.35   134.70

The first lambda function takes one column as input:

>>> # ds.Value becomes the 'x' in our lambda function.
>>> ds.Symbol.apply_reduce(lambda x: x.min() + 2, ds.Value)
*Symbol   Value
-------   -----
AAPL      11.36
AMZN       4.03
GME       16.65
SPY        7.76
TSLA       2.10

Our second lambda function takes two columns as input:

>>> ds.Symbol.apply_reduce(lambda x, y: x.sum() * y.mean(), (ds.Value, ds.Value2))
*Symbol       Value
-------   ---------
AAPL      26,904.13
AMZN      39,400.64
GME       26,857.53
SPY       32,560.75
TSLA      74,124.69

Also note that in this example, the first column listed in the tuple is the column name shown in the output.

If you like, you can use transform=True to expand the results and assign them to a column:

>>> ds.MyCalc1 = ds.Symbol.apply_reduce(lambda x: x.min() + 2, ds.Value, transform=True)
>>> ds.MyCalc2 = ds.Symbol.apply_reduce(lambda x, y: x.sum() * y.mean(),
...                                    (ds.Value, ds.Value2), transform=True)
>>> ds
  #   Symbol   Value   Value2   MyCalc1     MyCalc2
---   ------   -----   ------   -------   ---------
  0   AAPL     12.39    24.77     11.36   26,904.13
  1   SPY      41.04    82.09      7.76   32,560.75
  2   AMZN     55.69   111.39      4.03   39,400.64
  3   TSLA     93.07   186.14      2.10   74,124.69
  4   TSLA      3.62     7.24      2.10   74,124.69
  5   TSLA     62.15   124.29      2.10   74,124.69
  6   SPY      45.77    91.55      7.76   32,560.75
  7   AMZN      2.03     4.05      4.03   39,400.64
  8   SPY      24.95    49.91      7.76   32,560.75
  9   AMZN     11.85    23.70      4.03   39,400.64
 10   AMZN     21.68    43.36      4.03   39,400.64
 11   TSLA     27.46    54.91      2.10   74,124.69
 12   GME      40.13    80.26     16.65   26,857.53
 13   AMZN     52.90   105.81      4.03   39,400.64
 14   TSLA      0.10     0.20      2.10   74,124.69
...   ...        ...      ...       ...         ...
 35   TSLA     38.40    76.79      2.10   74,124.69
 36   AAPL     93.12   186.25     11.36   26,904.13
 37   SPY      14.92    29.85      7.76   32,560.75
 38   AAPL     99.71   199.41     11.36   26,904.13
 39   TSLA     37.91    75.83      2.10   74,124.69
 40   GME      64.88   129.75     16.65   26,857.53
 41   TSLA     96.79   193.58      2.10   74,124.69
 42   SPY       5.76    11.52      7.76   32,560.75
 43   TSLA     92.29   184.57      2.10   74,124.69
 44   AMZN     56.78   113.56      4.03   39,400.64
 45   AMZN     70.44   140.88      4.03   39,400.64
 46   TSLA     14.92    29.84      2.10   74,124.69
 47   AAPL     53.34   106.68     11.36   26,904.13
 48   TSLA     67.35   134.70      2.10   74,124.69
 49   TSLA     45.62    91.25      2.10   74,124.69

As expected, every instance of a category gets the same value.

apply_nonreduce()

For apply_nonreduce(), our lambda function computes a new value for every element of the original input:

>>> ds.MyCalc3 = ds.Symbol.apply_nonreduce(lambda x: x.cumsum() + 2, ds.Value)
>>> ds
  #   Symbol   Value   Value2   MyCalc1     MyCalc2   MyCalc3
---   ------   -----   ------   -------   ---------   -------
  0   AAPL     12.39    24.77     11.36   26,904.13     14.39
  1   SPY      41.04    82.09      7.76   32,560.75     43.04
  2   AMZN     55.69   111.39      4.03   39,400.64     57.69
  3   TSLA     93.07   186.14      2.10   74,124.69     95.07
  4   TSLA      3.62     7.24      2.10   74,124.69     98.69
  5   TSLA     62.15   124.29      2.10   74,124.69    160.84
  6   SPY      45.77    91.55      7.76   32,560.75     88.82
  7   AMZN      2.03     4.05      4.03   39,400.64     59.72
  8   SPY      24.95    49.91      7.76   32,560.75    113.77
  9   AMZN     11.85    23.70      4.03   39,400.64     71.57
 10   AMZN     21.68    43.36      4.03   39,400.64     93.25
 11   TSLA     27.46    54.91      2.10   74,124.69    188.30
 12   GME      40.13    80.26     16.65   26,857.53     42.13
 13   AMZN     52.90   105.81      4.03   39,400.64    146.15
 14   TSLA      0.10     0.20      2.10   74,124.69    188.40
...   ...        ...      ...       ...         ...       ...
 35   TSLA     38.40    76.79      2.10   74,124.69    417.18
 36   AAPL     93.12   186.25     11.36   26,904.13    133.05
 37   SPY      14.92    29.85      7.76   32,560.75    399.73
 38   AAPL     99.71   199.41     11.36   26,904.13    232.76
 39   TSLA     37.91    75.83      2.10   74,124.69    455.09
 40   GME      64.88   129.75     16.65   26,857.53    261.12
 41   TSLA     96.79   193.58      2.10   74,124.69    551.88
 42   SPY       5.76    11.52      7.76   32,560.75    405.49
 43   TSLA     92.29   184.57      2.10   74,124.69    644.17
 44   AMZN     56.78   113.56      4.03   39,400.64    437.63
 45   AMZN     70.44   140.88      4.03   39,400.64    508.07
 46   TSLA     14.92    29.84      2.10   74,124.69    659.09
 47   AAPL     53.34   106.68     11.36   26,904.13    286.10
 48   TSLA     67.35   134.70      2.10   74,124.69    726.44
 49   TSLA     45.62    91.25      2.10   74,124.69    772.06

Like apply_reduce(), apply_nonreduce() can take one or multiple columns as input:

>>> ds.MyCalc4 = ds.Symbol.apply_nonreduce(lambda x, y: x.cumsum() + y, (ds.Value, ds.Value2))
>>> ds
  #   Symbol   Value   Value2   MyCalc1     MyCalc2   MyCalc3   MyCalc4
---   ------   -----   ------   -------   ---------   -------   -------
  0   AAPL     12.39    24.77     11.36   26,904.13     14.39     37.16
  1   SPY      41.04    82.09      7.76   32,560.75     43.04    123.13
  2   AMZN     55.69   111.39      4.03   39,400.64     57.69    167.08
  3   TSLA     93.07   186.14      2.10   74,124.69     95.07    279.21
  4   TSLA      3.62     7.24      2.10   74,124.69     98.69    103.94
  5   TSLA     62.15   124.29      2.10   74,124.69    160.84    283.13
  6   SPY      45.77    91.55      7.76   32,560.75     88.82    178.36
  7   AMZN      2.03     4.05      4.03   39,400.64     59.72     61.77
  8   SPY      24.95    49.91      7.76   32,560.75    113.77    161.68
  9   AMZN     11.85    23.70      4.03   39,400.64     71.57     93.27
 10   AMZN     21.68    43.36      4.03   39,400.64     93.25    134.61
 11   TSLA     27.46    54.91      2.10   74,124.69    188.30    241.21
 12   GME      40.13    80.26     16.65   26,857.53     42.13    120.39
 13   AMZN     52.90   105.81      4.03   39,400.64    146.15    249.96
 14   TSLA      0.10     0.20      2.10   74,124.69    188.40    186.59
...   ...        ...      ...       ...         ...       ...       ...
 35   TSLA     38.40    76.79      2.10   74,124.69    417.18    491.97
 36   AAPL     93.12   186.25     11.36   26,904.13    133.05    317.30
 37   SPY      14.92    29.85      7.76   32,560.75    399.73    427.57
 38   AAPL     99.71   199.41     11.36   26,904.13    232.76    430.17
 39   TSLA     37.91    75.83      2.10   74,124.69    455.09    528.92
 40   GME      64.88   129.75     16.65   26,857.53    261.12    388.88
 41   TSLA     96.79   193.58      2.10   74,124.69    551.88    743.46
 42   SPY       5.76    11.52      7.76   32,560.75    405.49    415.01
 43   TSLA     92.29   184.57      2.10   74,124.69    644.17    826.74
 44   AMZN     56.78   113.56      4.03   39,400.64    437.63    549.19
 45   AMZN     70.44   140.88      4.03   39,400.64    508.07    646.95
 46   TSLA     14.92    29.84      2.10   74,124.69    659.09    686.92
 47   AAPL     53.34   106.68     11.36   26,904.13    286.10    390.78
 48   TSLA     67.35   134.70      2.10   74,124.69    726.44    859.14
 49   TSLA     45.62    91.25      2.10   74,124.69    772.06    861.31

apply()

If you want your custom function to return multiple aggregations – for example, you want to return both the mean value of one column and the minimum value of another column – use apply().

Warning: Because apply() isn’t a vectorized operation, it can be slow and use a lot of memory if you’re using it on large amounts of data. Try to avoid it if you can.

To be used with apply(), your function must be able to take in a Dataset. It can return a Dataset, a single array, or a dictionary of column names and values.

Here’s a function that performs two reducing operations and returns a Dataset:

>>> def my_apply_func(ds):
...     new_ds = rt.Dataset({
...     'Mean_Value': ds.Value.mean(),
...     'Min_Value': ds.Value.min()
... })
...     return new_ds

Here it is applied:

>>> ds.Symbol.apply(my_apply_func, ds)
*Symbol   Mean_Value   Min_Value
-------   ----------   ---------
AAPL           47.35        9.36
AMZN           38.93        2.03
GME            51.82       14.65
SPY            40.35        5.76
TSLA           48.13        0.10

Our second function performs two non-reducing operations:

>>> def my_apply_func2(ds):
...     new_ds = rt.Dataset({
...         'Val1': ds.Value * 3,
...         'Val2': ds.Value * 4
...     })
...     return new_ds
>>> ds.Symbol.apply(my_apply_func2, ds)
*gb_key_0     Val1     Val2
---------   ------   ------
AAPL         37.16    49.54
SPY         123.13   164.18
AMZN        167.08   222.77
TSLA        279.21   372.28
TSLA         10.87    14.49
TSLA        186.44   248.58
SPY         137.32   183.09
AMZN          6.08     8.10
SPY          74.86    99.82
AMZN         35.55    47.39
AMZN         65.04    86.72
TSLA         82.37   109.83
GME         120.39   160.52
AMZN        158.71   211.62
TSLA          0.30     0.40
...            ...      ...
TSLA        115.19   153.58
AAPL        279.37   372.50
SPY          44.77    59.69
AAPL        299.12   398.83
TSLA        113.74   151.65
GME         194.63   259.51
TSLA        290.37   387.16
SPY          17.28    23.04
TSLA        276.86   369.15
AMZN        170.34   227.12
AMZN        211.32   281.76
TSLA         44.76    59.67
AAPL        160.02   213.35
TSLA        202.05   269.41
TSLA        136.87   182.50

Because the operations in this function are non-reducing operations, the resulting Dataset is expanded.

Note that until a reported bug is fixed, column names might not persist through grouping operations.

For more in-depth information about Categoricals, see the Categoricals User Guide.

In the next section, Accums, we look at another way to do multi-key groupings with fancier output.


Questions or comments about this guide? Email RiptableDocumentation@sig.com.