Accums
Accums aggregate data similarly to Categoricals, but they distinguish themselves by providing a fancier output with overall aggregates in summary footers and columns.
Accum2()
Accum2()
is very much like a multi-key Categorical: It computes
aggregates values for pairs of groups. The difference is in the output –
an Accum2()
result looks more like a pivot table, with the first
group passed to the function providing row labels and the second
providing the column labels.
The function is also applied to each row and column, with results shown in a summary column and row, as well as to all columns and rows combined (with the result shown in the bottom right corner).
We’ll use a Dataset that’s similar to the one we used for multi-key Categoricals, so we can compare the output:
>>> rng = np.random.default_rng(seed=42)
>>> ds = rt.Dataset()
>>> N = 100
>>> ds.Symbol = rt.FA(rng.choice(['AAPL', 'AMZN', 'MSFT'], N))
>>> ds.Value = rt.FA(rng.random(N))
>>> ds.Date = rt.Date.range('20210101', days = 100) # Dates from January to mid-April
>>> ds.Month = ds.Date.start_of_month
>>> # Accum2 can take Categoricals or FastArrays as input.
>>> # To use this ds for accum_ratio, we need Symbol and Month to be Categoricals.
>>> ds.Symbol = rt.Cat(ds.Symbol)
>>> ds.Month = rt.Cat(ds.Month)
>>> ds
# Symbol Value Date Month
--- ------ ----- ---------- ----------
0 AAPL 0.20 2021-01-01 2021-01-01
1 MSFT 0.01 2021-01-02 2021-01-01
2 AMZN 0.79 2021-01-03 2021-01-01
3 AMZN 0.66 2021-01-04 2021-01-01
4 AMZN 0.71 2021-01-05 2021-01-01
5 MSFT 0.78 2021-01-06 2021-01-01
6 AAPL 0.46 2021-01-07 2021-01-01
7 MSFT 0.57 2021-01-08 2021-01-01
8 AAPL 0.14 2021-01-09 2021-01-01
9 AAPL 0.11 2021-01-10 2021-01-01
10 AMZN 0.67 2021-01-11 2021-01-01
11 MSFT 0.47 2021-01-12 2021-01-01
12 MSFT 0.57 2021-01-13 2021-01-01
13 MSFT 0.76 2021-01-14 2021-01-01
14 MSFT 0.63 2021-01-15 2021-01-01
... ... ... ... ...
85 MSFT 0.02 2021-03-27 2021-03-01
86 AAPL 0.96 2021-03-28 2021-03-01
87 AAPL 0.48 2021-03-29 2021-03-01
88 MSFT 0.78 2021-03-30 2021-03-01
89 MSFT 0.08 2021-03-31 2021-03-01
90 AMZN 0.49 2021-04-01 2021-04-01
91 MSFT 0.49 2021-04-02 2021-04-01
92 MSFT 0.94 2021-04-03 2021-04-01
93 AMZN 0.57 2021-04-04 2021-04-01
94 MSFT 0.47 2021-04-05 2021-04-01
95 AAPL 0.27 2021-04-06 2021-04-01
96 AAPL 0.33 2021-04-07 2021-04-01
97 MSFT 0.52 2021-04-08 2021-04-01
98 AMZN 0.44 2021-04-09 2021-04-01
99 AAPL 0.02 2021-04-10 2021-04-01
Here’s the Accum2()
table before we apply an aggregation function.
You can see how many values fall into each group pair:
>>> rt.Accum2(ds.Symbol, ds.Month)
Accum2 Keys
X:Date(['2021-01-01', '2021-02-01', '2021-03-01', '2021-04-01'])
Y:FastArray([b'AAPL', b'AMZN', b'MSFT'], dtype='|S4')
Bins:20 Rows:100
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Sum
------- ---------- ---------- ---------- ---------- ---
AAPL 6 9 9 3 27
AMZN 13 8 9 3 33
MSFT 12 11 13 4 40
If we aggregate with count()
, it has the same data, but we see the
output formatting:
>>> rt.Accum2(ds.Symbol, ds.Month).count()
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Sum
------- ---------- ---------- ---------- ---------- ---
AAPL 6 9 9 3 27
AMZN 13 8 9 3 33
MSFT 12 11 13 4 40
Sum 31 28 31 10 100
The bottom row and rightmost column provide summary data.
Now we’ll get the average value per symbol-month pair:
>>> rt.Accum2(ds.Symbol, ds.Month).mean(ds.Value)
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Mean
------- ---------- ---------- ---------- ---------- ----
AAPL 0.35 0.40 0.54 0.21 0.41
AMZN 0.54 0.48 0.45 0.50 0.50
MSFT 0.44 0.47 0.42 0.61 0.46
Mean 0.47 0.45 0.46 0.45 0.46
Note that the summary row and column show the mean values for all the input values for each group, not just the means of the displayed group means.
To illustrate: Here’s the mean of the displayed group mean values for AAPL:
>>> (0.35 + 0.40 + 0.54 + 0.21) / 4
0.375
And here’s the mean of all AAPL values:
>>> ds.Value.nanmean(filter=ds.Symbol == 'AAPL')
0.41317486824408933
For comparison, here’s the multi-key Categorical version:
>>> ds.Symbol_Month = rt.Cat([ds.Symbol, ds.Month])
>>> ds.Symbol_Month.mean(ds.Value)
*Symbol *Month Value
------- ---------- -----
AAPL 2021-01-01 0.35
MSFT 2021-01-01 0.44
AMZN 2021-01-01 0.54
AAPL 2021-02-01 0.40
AMZN 2021-02-01 0.48
MSFT 2021-02-01 0.47
. 2021-03-01 0.42
AMZN 2021-03-01 0.45
AAPL 2021-03-01 0.54
AMZN 2021-04-01 0.50
MSFT 2021-04-01 0.61
AAPL 2021-04-01 0.21
You can pass a filter keyword argument to the function you call on
Accum2()
:
>>> rt.Accum2(ds.Symbol, ds.Month).mean(ds.Value, filter=ds.Value > 0.5)
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Mean
------- ---------- ---------- ---------- ---------- ----
AAPL 0.85 0.74 0.76 nan 0.77
AMZN 0.67 0.75 0.72 0.57 0.69
MSFT 0.65 0.78 0.70 0.73 0.71
Mean 0.67 0.76 0.72 0.68 0.71
accum_ratio()
For each group pair, accum_ratio()
computes a ratio of values you
specify. The results are presented in an Accum table.
For our example we’ll add PnL and Size (number of sales) columns, and
we’ll use accum_ratio()
to get the PnL for each symbol-month bucket,
weighted by size:
>>> ds.PnL = rng.normal(10, 20, 100)
>>> ds.Size = rng.random(100) * 100
Like Accum2()
, accum_ratio()
takes two Categoricals (a row
Categorical and a column Categorical). You also specify the numerator
values and denominator values. For each group pair, it sums the
numerator values and denominator values and presents the ratios in a
table:
>>> rt.accum_ratio(ds.Symbol, ds.Month, ds.PnL * ds.Size, ds.Size, include_numer=True)
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Ratio Numer Denom
------- ---------- ---------- ---------- ---------- ----- --------- --------
AAPL 3.13 11.93 1.95 28.86 8.81 12,363.71 1,404.13
AMZN 5.54 2.36 23.34 -2.94 10.01 16,971.55 1,695.67
MSFT 23.90 22.78 -1.40 -9.61 10.35 17,501.11 1,690.46
Ratio 10.13 13.17 7.31 8.25 9.78
Numer 10,604.13 18,953.08 13,471.17 3,807.98 46,836.36
Denom 1,047.18 1,438.84 1,842.65 461.59 4,790.26
The result is the ratio of the following two tables.
Numerator:
>>> rt.Accum2(ds.Symbol, ds.Month).nansum(ds.Size * ds.PnL)
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Nansum
------- ---------- ---------- ---------- ---------- ---------
AAPL 699.07 5,075.98 1,100.76 5,487.90 12,363.71
AMZN 2,956.74 1,065.03 13,358.59 -408.81 16,971.55
MSFT 6,948.32 12,812.08 -988.18 -1,271.11 17,501.11
Nansum 10,604.13 18,953.08 13,471.17 3,807.98 46,836.36
Denominator:
>>> rt.Accum2(ds.Symbol, ds.Month).nansum(ds.Size)
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Nansum
------- ---------- ---------- ---------- ---------- --------
AAPL 223.12 425.49 565.38 190.13 1,404.13
AMZN 533.28 450.83 572.34 139.22 1,695.67
MSFT 290.78 562.52 704.92 132.24 1,690.46
Nansum 1,047.18 1,438.84 1,842.65 461.59 4,790.26
When the numerator and denominator are the same, the result is as you might expect:
>>> rt.accum_ratio(ds.Symbol, ds.Month, ds.Size, ds.Size, include_numer=True)
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 Ratio Numer Denom
------- ---------- ---------- ---------- ---------- ----- -------- --------
AAPL 1.00 1.00 1.00 1.00 1.00 1,404.13 1,404.13
AMZN 1.00 1.00 1.00 1.00 1.00 1,695.67 1,695.67
MSFT 1.00 1.00 1.00 1.00 1.00 1,690.46 1,690.46
Ratio 1.00 1.00 1.00 1.00 1.00
Numer 1,047.18 1,438.84 1,842.65 461.59 4,790.26
Denom 1,047.18 1,438.84 1,842.65 461.59 4,790.26
accum_ratiop()
accum_ratiop()
takes one column of values as numerators and computes
an internal ratio for each group pair, where the denominator is one of
three sums:
The row sum (
norm_by='R'
)The column sum (
norm_by='C'
)The total sum (
norm_by='T'
)
For example, this table shows that 30.30% of AAPL sales were in February:
>>> rt.accum_ratiop(ds.Symbol, ds.Month, ds.Size, norm_by='R')
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 TotalRatio Total
---------- ---------- ---------- ---------- ---------- ---------- --------
AAPL 15.89 30.30 40.27 13.54 100.00 1,404.13
AMZN 31.45 26.59 33.75 8.21 100.00 1,695.67
MSFT 17.20 33.28 41.70 7.82 100.00 1,690.46
TotalRatio 21.86 30.04 38.47 9.64 100.00
Total 1,047.18 1,438.84 1,842.65 461.59 4,790.26
Note that the percentages in each row sum to 100%.
We can check the math by computing the ratio of AAPL’s February sales to AAPL’s total sales:
>>> filt_feb_aapl = (ds.Symbol == 'AAPL') & (ds.Month.as_string_array == rt.Date('20210201'))
>>> filt_total_aapl = ds.Symbol == 'AAPL'
>>> ds.Size[filt_feb_aapl].nansum() / ds.Size[filt_total_aapl].nansum()
0.3030291108538412
This table shows that AAPL’s sales are 29.57% of February sales:
>>> rt.accum_ratiop(ds.Symbol, ds.Month, ds.Size, norm_by='C')
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 TotalRatio Total
---------- ---------- ---------- ---------- ---------- ---------- --------
AAPL 21.31 29.57 30.68 41.19 29.31 1,404.13
AMZN 50.93 31.33 31.06 30.16 35.40 1,695.67
MSFT 27.77 39.10 38.26 28.65 35.29 1,690.46
TotalRatio 100.00 100.00 100.00 100.00 100.00
Total 1,047.18 1,438.84 1,842.65 461.59 4,790.26
Note that the percentages in each column sum to 100%.
Check the math:
>>> filt_feb_total = ds.Month.as_string_array == rt.Date('20210201')
>>> ds.Size[filt_feb_aapl].nansum() / ds.Size[filt_feb_total].nansum()
0.29571866540362846
This table shows that AAPL’s February sales represent 8.88% of all sales:
>>> rt.accum_ratiop(ds.Symbol, ds.Month, ds.Size, norm_by='T')
*Symbol 2021-01-01 2021-02-01 2021-03-01 2021-04-01 TotalRatio Total
---------- ---------- ---------- ---------- ---------- ---------- --------
AAPL 4.66 8.88 11.80 3.97 29.31 1,404.13
AMZN 11.13 9.41 11.95 2.91 35.40 1,695.67
MSFT 6.07 11.74 14.72 2.76 35.29 1,690.46
TotalRatio 21.86 30.04 38.47 9.64 100.00
Total 1,047.18 1,438.84 1,842.65 461.59 4,790.26
Note that the “TotalRatio” row and column percentages each sum to 100%.
Check the math:
>>> ds.Size[filt_feb_aapl].nansum() / ds.Size.nansum()
0.08882445025331744
Next, for something completely different, we’ll explore ways to Concatenate Datasets.
Questions or comments about this guide? Email RiptableDocumentation@sig.com.