Solutions to Riptable Exercises

This notebook contains the solutions to the Riptable Exercises.

Your solutions may be implemented slightly differently, but they should get the same essential results.

If you have any questions or comments, email RiptableDocumentation@sig.com.

[1]:

import riptable as rt
import numpy as np

Introduction to the Riptable Dataset

Datasets are the core class of riptable.

They are tables of data, consisting of a series of columns of the same length (sometimes referred to as fields).

Structurally, they behave like python dictionaries, and can be created directly from one.

We’ll familiarize ourselves with Datasets by manually constructing one by generating fake sample data using np.random.default_rng().choice(...) or similar.

In real life they will essentially always be generated from world data.

First, create a python dictionary with two fields of the same length (>1000); one column of stock prices and one of symbols.

Make sure the symbols have duplicates, for later aggregation exercises.

[2]:

rng = np.random.default_rng()
dset_length = 5_000

[3]:

my_dict = {'Price': rng.uniform(0, 1000, dset_length), 'Symbol': rng.choice(['GME', 'AMZN', 'TSLA', 'SPY'], dset_length)}

Create a riptable dataset from this, using rt.Dataset(my_dict).

[4]:

my_dset = rt.Dataset(my_dict)

You can easily append more columns to a dataset.

Add a new column of integer trade size, using my_dset.Size =.

[5]:

my_dset.Size = rng.integers(1, 1000, dset_length)

Columns can be referred with brackets around a string name as well. This is typically used when the column name comes from a variable.

Add a new column of booleans indicating whether you traded this trade, using my_dset['MyTrade'] =.

[6]:

my_dset['MyTrade'] = rng.choice([True, False], dset_length)

Add a new column of string “Buy” or “Sell” indicating the customer direction.

[7]:

my_dset.CustDirection = rng.choice(['Buy', 'Sell'], dset_length)

Riptable will convert these lists to the riptable FastArray container and cast the data to an appropriate numpy datatype.

View the datatypes with my_dset.dtypes.

[8]:

my_dset.dtypes

[8]:

{'Price': dtype('float64'),
 'Symbol': dtype('S4'),
 'Size': dtype('int64'),
 'MyTrade': dtype('bool'),
 'CustDirection': dtype('S4')}

View some sample rows of the dataset using .sample().

You should use this instead of .head() because the initial rows of a dataset are often unrepresentative.

[9]:

my_dset.sample()

[9]:

#	Price	Symbol	Size	MyTrade	CustDirection
0	294.84	SPY	18	False	Buy
1	80.15	TSLA	939	True	Buy
2	189.83	AMZN	919	True	Buy
3	795.83	SPY	324	True	Sell
4	111.57	AMZN	53	True	Buy
5	695.09	AMZN	173	True	Sell
6	109.21	AMZN	810	True	Buy
7	83.23	SPY	674	True	Sell
8	388.80	AMZN	872	False	Sell
9	744.19	SPY	164	False	Buy

[10 rows x 5 columns] total bytes: 250.0 B

View distributional stats of the numerical fields of your dataset with .describe().

You can call this on a single column as well.

[10]:

my_dset.describe()

[10]:

*Stats	Price	Size	MyTrade
Count	5000.00	5000.00	5000.00
Valid	5000.00	5000.00	5000.00
Nans	0.00	0.00	0.00
Mean	495.85	496.97	0.50
Std	292.47	288.05	0.50
Min	0.24	1.00	0.00
P10	92.07	97.00	0.00
P25	240.33	247.00	0.00
P50	493.73	499.00	0.00
P75	750.31	747.00	1.00
P90	902.58	895.00	1.00
Max	999.94	999.00	1.00
MeanM	494.98	497.04	0.50

[13 rows x 4 columns] total bytes: 377.0 B

Manipulating data

You can perform simple operation on riptable columns with normal python syntax. Riptable will do them to the whole column at once, efficiently.

Create a new column by performing scalar arithmetic on one of your numeric columns.

[11]:

my_dset.SharesOfStock = 100 * my_dset.Size

[12]:

my_dset.sample()

[12]:

#	Price	Symbol	Size	MyTrade	CustDirection	SharesOfStock
0	594.55	GME	428	False	Sell	42800
1	575.25	SPY	218	False	Buy	21800
2	88.07	TSLA	655	False	Sell	65500
3	616.62	SPY	941	True	Buy	94100
4	589.26	TSLA	235	False	Buy	23500
5	560.63	SPY	99	True	Sell	9900
6	461.22	SPY	982	True	Buy	98200
7	636.12	AMZN	96	False	Sell	9600
8	632.51	GME	253	True	Sell	25300
9	293.35	AMZN	104	False	Buy	10400

[10 rows x 6 columns] total bytes: 330.0 B

As long as the columns are the same size (as is guaranteed if they’re in the same dataset) you can perform combining operations the same way.

Create a new column of total price paid for the trade by multiplying two existing columns together.

Riptable will automatically upcast types as necessary to preserve information.

[13]:

my_dset.TotalCash = my_dset.Price * my_dset.Size

[14]:

my_dset.sample()

[14]:

#	Price	Symbol	Size	MyTrade	CustDirection	SharesOfStock	TotalCash
0	263.41	AMZN	533	False	Buy	53300	140395.27
1	6.94	SPY	111	True	Buy	11100	770.02
2	238.68	SPY	409	False	Sell	40900	97620.54
3	420.92	SPY	463	False	Buy	46300	194887.29
4	435.00	TSLA	676	False	Buy	67600	294057.25
5	947.84	TSLA	334	False	Buy	33400	316579.05
6	439.43	AMZN	364	False	Buy	36400	159952.95
7	44.15	AMZN	222	False	Buy	22200	9800.34
8	972.57	GME	243	False	Sell	24300	236334.91
9	873.75	TSLA	599	False	Sell	59900	523377.81

[10 rows x 7 columns] total bytes: 410.0 B

There are many built-in functions as well, which you call with either my_dset.field.function() or rt.function(my_dset.field) syntax.

Find the unique Symbols in your dataset.

[15]:

my_dset.Symbol.unique()

[15]:

FastArray([b'AMZN', b'GME', b'SPY', b'TSLA'], dtype='|S4')

Date/Time

Riptable has three main date/time types: Date, DateTimeNano, and TimeSpan.

Give each row of your dataset an rt.Date.

Make sure they’re not all different, but still include days from multiple months.

Note that due to Riptable idiosyncracies you need to generate a list of yyyymmdd strings and pass into the rt.Date(...) constructor, not construct Dates individually.

[16]:

my_dset.Date = rt.Date(rng.choice(rt.Date.range('20220201', '20220430'), dset_length))

[17]:

my_dset.sample()

[17]:

#	Price	Symbol	Size	MyTrade	CustDirection	SharesOfStock	TotalCash	Date
0	28.77	SPY	452	False	Sell	45200	13004.17	2022-02-18
1	705.58	SPY	892	False	Sell	89200	629378.88	2022-02-17
2	225.44	SPY	221	True	Buy	22100	49821.42	2022-04-18
3	647.78	TSLA	488	True	Sell	48800	316118.91	2022-03-10
4	926.05	GME	426	False	Sell	42600	394497.53	2022-04-14
5	226.91	SPY	705	True	Sell	70500	159972.93	2022-04-27
6	816.29	SPY	400	True	Sell	40000	326516.95	2022-03-12
7	177.41	GME	257	True	Buy	25700	45594.88	2022-03-10
8	414.71	AMZN	232	False	Sell	23200	96213.22	2022-04-19
9	471.17	AMZN	576	False	Buy	57600	271393.67	2022-04-25

[10 rows x 8 columns] total bytes: 450.0 B

Give each row a unique(ish) TimeSpan as a trade time.

You can instantiate them using rt.TimeSpan(hours_var, unit='h').

[18]:

my_dset.TradeTime = rt.TimeSpan(rng.uniform(9.5, 16, dset_length), unit='h')

[19]:

my_dset.sample()

[19]:

#	Price	Symbol	Size	MyTrade	...	SharesOfStock	TotalCash	Date	TradeTime
0	189.25	SPY	779	False	...	77900	147423.20	2022-02-19	09:47:31.067694548
1	405.30	TSLA	575	True	...	57500	233044.97	2022-02-28	13:47:22.412012377
2	306.29	SPY	12	True	...	1200	3675.43	2022-03-26	12:56:47.572327046
3	740.90	SPY	872	False	...	87200	646063.08	2022-02-20	11:16:31.322328844
4	148.72	AMZN	469	True	...	46900	69751.19	2022-02-22	13:45:17.960767118
5	687.63	AMZN	900	False	...	90000	618863.96	2022-02-16	13:03:38.246123166
6	395.53	AMZN	171	False	...	17100	67635.54	2022-04-23	15:47:05.167722885
7	620.26	TSLA	777	False	...	77700	481941.71	2022-04-14	11:40:32.776661126
8	266.59	GME	765	True	...	76500	203941.39	2022-04-12	14:37:57.503811705
9	286.83	AMZN	613	False	...	61300	175824.15	2022-03-03	14:45:09.378948295

[10 rows x 9 columns] total bytes: 530.0 B

Create a DateTimeNano of the combined TradeTime + Date by simple addition. Riptable knows how to sum the types.

Be careful here, by default you’ll get a GMT timezone, you can force NYC with rt.DateTimeNano(..., from_tz='NYC').

[20]:

my_dset.TradeDateTime = rt.DateTimeNano(my_dset.Date + my_dset.TradeTime, from_tz='NYC')

[21]:

my_dset.sample()

[21]:

#	Price	Symbol	Size	MyTrade	...	TotalCash	Date	TradeTime	TradeDateTime
0	455.87	TSLA	397	True	...	180979.30	2022-03-19	13:27:13.903568326	20220319 13:27:13.903568326
1	73.24	TSLA	887	False	...	64965.26	2022-03-06	11:30:56.002432889	20220306 11:30:56.002432889
2	263.36	SPY	807	True	...	212535.23	2022-02-19	13:34:55.264039702	20220219 13:34:55.264039702
3	88.49	GME	382	True	...	33802.09	2022-03-01	14:31:17.048883242	20220301 14:31:17.048883242
4	889.71	AMZN	554	False	...	492899.07	2022-03-08	11:12:06.558229570	20220308 11:12:06.558229570
5	466.92	TSLA	688	True	...	321237.96	2022-04-26	09:59:53.558187626	20220426 09:59:53.558187626
6	165.87	AMZN	272	False	...	45116.56	2022-03-10	13:41:53.830767087	20220310 13:41:53.830767087
7	740.90	SPY	872	False	...	646063.08	2022-02-20	11:16:31.322328844	20220220 11:16:31.322328844
8	146.13	TSLA	896	True	...	130931.03	2022-02-24	12:18:33.660870308	20220224 12:18:33.660870308
9	240.26	AMZN	587	True	...	141032.79	2022-03-02	10:43:44.856459110	20220302 10:43:44.856459110

[10 rows x 10 columns] total bytes: 610.0 B

To reverse this operation and get out separate dates and times from a DateTimeNano, you can call rt.Date(my_DateTimeNano) and my_DateTimeNano.time_since_midnight().

Create a new month name column by using the .strftime function.

[22]:

my_dset.month_name = my_dset.Date.strftime('%b%y')

[23]:

my_dset.sample()

[23]:

#	Price	Symbol	Size	MyTrade	...	TradeTime	TradeDateTime	month_name
0	731.57	TSLA	687	True	...	13:18:00.786792038	20220426 13:18:00.786792038	Apr22
1	491.23	TSLA	472	True	...	12:13:02.001061236	20220313 12:13:02.001061236	Mar22
2	321.02	SPY	739	True	...	10:05:47.468199835	20220428 10:05:47.468199835	Apr22
3	728.75	TSLA	191	True	...	13:33:27.072606771	20220204 13:33:27.072606771	Feb22
4	414.23	AMZN	955	False	...	12:21:10.042617283	20220209 12:21:10.042617283	Feb22
5	399.06	TSLA	608	True	...	10:22:12.201005950	20220430 10:22:12.201005950	Apr22
6	918.26	AMZN	93	False	...	12:32:03.015948824	20220227 12:32:03.015948824	Feb22
7	787.75	GME	408	True	...	15:00:34.274163745	20220420 15:00:34.274163745	Apr22
8	478.57	TSLA	595	False	...	13:17:26.624255585	20220211 13:17:26.624255585	Feb22
9	488.22	SPY	807	False	...	13:45:28.888911507	20220228 13:45:28.888911507	Feb22

[10 rows x 11 columns] total bytes: 660.0 B

Create another new month column by using the .start_of_month attribute.

This is nice for grouping because it will automatically sort correctly.

[24]:

my_dset.month = my_dset.Date.start_of_month

[25]:

my_dset.sample()

[25]:

#	Price	Symbol	Size	MyTrade	...	TradeDateTime	month_name	month
0	272.34	SPY	593	False	...	20220421 15:56:06.447740285	Apr22	2022-04-01
1	758.55	SPY	210	True	...	20220425 10:01:29.981701403	Apr22	2022-04-01
2	654.72	SPY	613	True	...	20220330 09:38:12.459630380	Mar22	2022-03-01
3	417.83	TSLA	737	True	...	20220426 15:29:01.508398022	Apr22	2022-04-01
4	280.31	AMZN	958	True	...	20220327 11:43:36.813136697	Mar22	2022-03-01
5	139.19	AMZN	52	False	...	20220418 13:50:10.245021957	Apr22	2022-04-01
6	915.93	TSLA	18	True	...	20220420 13:55:37.175809652	Apr22	2022-04-01
7	877.00	TSLA	928	False	...	20220223 11:51:05.923929754	Feb22	2022-02-01
8	61.40	AMZN	472	True	...	20220311 09:40:49.323285896	Mar22	2022-03-01
9	807.30	SPY	863	True	...	20220325 11:48:40.251197776	Mar22	2022-03-01

[10 rows x 12 columns] total bytes: 700.0 B

Sorting

Riptable has two sorts, sort_copy (which preserves the original dataset) and sort_inplace, which is faster and more memory-efficient if you don’t need the original data order.

Sort your dataset by TradeDateTime.

This is the natural ordering of a list of trades, so do it in-place.

[26]:

my_dset = my_dset.sort_inplace('TradeDateTime')

[27]:

my_dset.sample()

[27]:

#	Price	Symbol	Size	MyTrade	...	TradeDateTime	month_name	month
0	379.21	AMZN	385	False	...	20220203 10:58:19.673425054	Feb22	2022-02-01
1	617.50	SPY	413	True	...	20220214 14:02:42.426523660	Feb22	2022-02-01
2	555.63	SPY	617	False	...	20220308 13:25:17.282540250	Mar22	2022-03-01
3	718.59	AMZN	34	True	...	20220324 11:57:46.925665938	Mar22	2022-03-01
4	838.81	GME	958	False	...	20220330 10:09:37.013952547	Mar22	2022-03-01
5	498.17	AMZN	135	False	...	20220403 12:36:29.924505469	Apr22	2022-04-01
6	123.59	AMZN	927	True	...	20220414 12:51:26.312197854	Apr22	2022-04-01
7	339.86	AMZN	805	True	...	20220417 13:05:59.168005100	Apr22	2022-04-01
8	158.91	SPY	211	False	...	20220428 09:30:22.411907771	Apr22	2022-04-01
9	936.81	TSLA	81	False	...	20220430 15:06:53.166066606	Apr22	2022-04-01

[10 rows x 12 columns] total bytes: 700.0 B

Filtering

Filtering is the principal way to work with a subset of your data in riptable. It is commonly used for looking at a restricted set of trades matching some criterion you care about.

Except in rare instances, though, you should maintain your dataset in its full size, and only apply a filter when performing a final computation.

This will avoid unnecessary data duplication and improve speed & memory usage.

Construct a filter of only your sales. (A filter is a column of Booleans which is true only for the rows you’re interested in.)

You can combine filters using & or |. Be careful to always wrap expressions in parentheses to avoid an extremely slow call into native python followed by a crash.

Always (my_dset.field1 > 10) & (my_dset.field2 < 5), never my_dset.field1 > 10 & my_dset.field2 > 5.

[28]:

f_my_sales = my_dset.MyTrade & (my_dset.CustDirection == 'Buy')

Compute the total Trade Size, filtered for only your sales.

For this and many other instances, you can & should pass your filter into the filter kwarg of the .nansum(...) call.

This allows riptable to perform the filtering during the nansum computation, rather than instantiating a new column and then summing it.

[29]:

my_dset.Size.nansum(filter=f_my_sales)

[29]:

Count how many times you sold each symbol.

Here the .count() function doesn’t accept a filter kwarg, so you must fall back to explicitly filtering the Symbol field before counting.

Be careful that you only filter down the Symbol field, not the entire dataset, otherwise you are wasting a lot of compute.

[30]:

my_dset.Symbol[f_my_sales].count()

[30]:

*Unique	Count
AMZN	301
GME	306
SPY	282
TSLA	340

[4 rows x 2 columns] total bytes: 32.0 B

Categoricals

So far, we’ve been operating on your symbol column as a column of strings.

However, it’s far more efficient when you have a large column with many repeats to use a categorical, which assigns each unique value a number, and stores the labels & numbers separately.

This is memory-efficient, and also computationally efficient, as riptable can peform operations on the unique values, then expand out to the full vector appropriately.

Make a new column of your string column converted to a categorical, using rt.Cat(column).

[31]:

my_dset.Symbol_cat = rt.Cat(my_dset.Symbol)
my_dset.Symbol_cat

[31]:

Categorical([AMZN, SPY, SPY, SPY, SPY, ..., TSLA, GME, SPY, AMZN, SPY]) Length: 5000
  FastArray([1, 3, 3, 3, 3, ..., 4, 2, 3, 1, 3], dtype=int8) Base Index: 1
  FastArray([b'AMZN', b'GME', b'SPY', b'TSLA'], dtype='|S4') Unique count: 4

Perform the same filtered count from above, on the categorical.

The categorical .count() admits a filter kwarg, which makes it simpler.

[32]:

my_dset.Symbol_cat.count(filter=f_my_sales)

[32]:

*Symbol_cat	Count
AMZN	301
GME	306
SPY	282
TSLA	340

[4 rows x 2 columns] total bytes: 32.0 B

Categoricals can be used as groupings. When you call a numeric function on a categorical and pass numeric columns in, riptable knows to do the calculation per-group.

Compute the total amount of contracts sold by customers in each symbol.

[33]:

my_dset.Symbol_cat.sum(my_dset.Size, filter=my_dset.CustDirection == 'Sell')

[33]:

*Symbol_cat	Size
AMZN	303513
GME	290964
SPY	337699
TSLA	304961

[4 rows x 2 columns] total bytes: 48.0 B

The transform=True kwarg in a categorical operation performs the aggregation, then transforms it back up to the original shape of the categorical, giving each row the appropriate value from its group.

Make a new column which is the average trade price, per symbol.

[34]:

my_dset.average_trade_price = my_dset.Symbol_cat.mean(my_dset.Price, transform=True)

Inspect with .sample() to confirm that this value is consistent for rows with matching symbol.

[35]:

my_dset.sample()

[35]:

#	Price	Symbol	Size	MyTrade	...	month_name	month	Symbol_cat	average_trade_price
0	612.27	AMZN	356	False	...	Feb22	2022-02-01	AMZN	497.66
1	5.42	AMZN	610	False	...	Feb22	2022-02-01	AMZN	497.66
2	877.96	AMZN	58	False	...	Feb22	2022-02-01	AMZN	497.66
3	340.75	AMZN	802	True	...	Mar22	2022-03-01	AMZN	497.66
4	564.53	GME	486	True	...	Apr22	2022-04-01	GME	495.29
5	46.86	TSLA	414	True	...	Apr22	2022-04-01	TSLA	499.91
6	850.28	SPY	723	True	...	Apr22	2022-04-01	SPY	490.44
7	895.93	SPY	967	False	...	Apr22	2022-04-01	SPY	490.44
8	267.21	AMZN	98	True	...	Apr22	2022-04-01	AMZN	497.66
9	279.97	GME	887	True	...	Apr22	2022-04-01	GME	495.29

[10 rows x 14 columns] total bytes: 806.0 B

If you need to perform a custom operation on each categorical, you can pass in a function with .apply_reduce (which aggregates) or .apply_nonreduce (which is like transform=True).

Note that the custom function you pass needs to expect a FastArray, and output a scalar (apply_reduce) or same-length FastArray (apply_nonreduce).

Find, for each symbol, the trade size of the second trade occuring in the dataset.

[36]:

my_dset.Symbol_cat.apply_reduce(lambda x: x[1], my_dset.Size)

[36]:

*Symbol_cat	Size
AMZN	700
GME	42
SPY	492
TSLA	536

[4 rows x 2 columns] total bytes: 48.0 B

Sometimes you want to aggregate based on multiple values. In these cases we use multi-key categoricals.

Use a multi-key categorical to compute the average size per symbol-month pair.

[37]:

my_dset.Symbol_month_cat = rt.Cat([my_dset.Symbol, my_dset.month])

[38]:

my_dset.Symbol_month_cat.nanmean(my_dset.Size).sort_inplace('Symbol')

[38]:

*Symbol	*month	Size
AMZN	2022-02-01	473.99
.	2022-03-01	495.32
.	2022-04-01	493.56
GME	2022-02-01	508.95
.	2022-03-01	506.99
.	2022-04-01	479.18
SPY	2022-02-01	509.76
.	2022-03-01	529.28
.	2022-04-01	479.58
TSLA	2022-02-01	501.33
.	2022-03-01	469.43
.	2022-04-01	517.81

[12 rows x 3 columns] total bytes: 192.0 B

Accumulating

Aggregating over two values for human viewing is often most conveniently done with an accum.

Use Accum2 to compute the average size per symbol-month pair.

[39]:

rt.Accum2(my_dset.Symbol, my_dset.month).nanmean(my_dset.Size)

[39]:

*Symbol	2022-02-01	2022-03-01	2022-04-01	Nanmean
AMZN	473.99	495.32	493.56	487.57
GME	508.95	506.99	479.18	498.34
SPY	509.76	529.28	479.58	506.63
TSLA	501.33	469.43	517.81	495.67
Nanmean	497.85	499.98	492.81	496.97

[4 rows x 5 columns] total bytes: 144.0 B

Average numbers can be meaningless. It is often better to consider relative percentage instead.

Use accum_ratiop to compute the fraction of total volume done by each symbol-month pair.

[40]:

rt.accum_ratiop(my_dset.Symbol, my_dset.month, my_dset.Size, norm_by='R')

[40]:

*Symbol	2022-02-01	2022-03-01	2022-04-01	TotalRatio	Total
AMZN	32.93	36.70	30.37	100.00	628959
GME	31.96	36.02	32.02	100.00	594021
SPY	32.93	36.10	30.98	100.00	636328
TSLA	31.90	33.17	34.93	100.00	625533
TotalRatio	32.44	35.49	32.07	100.00
Total	806012	881963	796866		2484841

[4 rows x 6 columns] total bytes: 176.0 B

Merging

There are two main types of merges.

First is merge_lookup. This is used for enriching one (typically large) dataset with information from another (typically small) dataset.

Create a new dataset with one row per symbol from your dataset, and a second column of who trades each symbol.

[41]:

symbol_trader = rt.Dataset({'UnderlyingSymbol': ['GME', 'TSLA', 'SPY', 'AMZN'],
                           'Trader': ['Nate', 'Elon', 'Josh', 'Dan']})

[42]:

symbol_trader

[42]:

#	UnderlyingSymbol	Trader
0	GME	Nate
1	TSLA	Elon
2	SPY	Josh
3	AMZN	Dan

[4 rows x 2 columns] total bytes: 32.0 B

Enrich the main dataset by putting the correct trader into each row.

[43]:

my_dset.Trader = my_dset.merge_lookup(symbol_trader, on=('Symbol', 'UnderlyingSymbol'), columns_left=[])['Trader']

[44]:

my_dset.sample()

[44]:

#	Price	Symbol	Size	MyTrade	...	Symbol_cat	average_trade_price	Symbol_month_cat	Trader
0	702.44	SPY	739	True	...	SPY	490.44	(SPY, 2022-02-01)	Josh
1	926.83	TSLA	591	True	...	TSLA	499.91	(TSLA, 2022-02-01)	Elon
2	450.66	SPY	459	False	...	SPY	490.44	(SPY, 2022-02-01)	Josh
3	664.47	SPY	846	True	...	SPY	490.44	(SPY, 2022-03-01)	Josh
4	464.46	AMZN	379	True	...	AMZN	497.66	(AMZN, 2022-03-01)	Dan
5	508.80	GME	907	False	...	GME	495.29	(GME, 2022-03-01)	Nate
6	145.61	TSLA	289	True	...	TSLA	499.91	(TSLA, 2022-04-01)	Elon
7	729.66	GME	148	False	...	GME	495.29	(GME, 2022-04-01)	Nate
8	20.50	TSLA	769	True	...	TSLA	499.91	(TSLA, 2022-04-01)	Elon
9	768.59	GME	957	True	...	GME	495.29	(GME, 2022-04-01)	Nate

[10 rows x 16 columns] total bytes: 952.0 B

The second type of merge is merge_asof, which is used for fuzzy alignment between two datasets, typically by time (though often by other variables).

Create a new index price dataset with one price per minute, which covers all the Dates in your dataset.

The index price doesn’t need to be reasonable.

Each row should have a DateTimeNano as the datetime.

[45]:

num_minutes = int((my_dset.TradeDateTime.max() - my_dset.TradeDateTime.min()).minutes[0])
start_datetime = rt.Date(my_dset.TradeDateTime.min())

[46]:

index_price = rt.Dataset({'DateTime': start_datetime + rt.TimeSpan(range(num_minutes), unit='m'),
                          'IndexPrice': rng.uniform(3500, 4500, num_minutes)})

[47]:

index_price.sample()

[47]:

#	DateTime	IndexPrice
0	20220217 07:25:00.000000000	3742.56
1	20220218 12:24:00.000000000	4439.16
2	20220225 16:41:00.000000000	3833.25
3	20220303 13:44:00.000000000	4341.40
4	20220326 08:00:00.000000000	4356.62
5	20220402 02:58:00.000000000	3796.68
6	20220403 15:55:00.000000000	3645.95
7	20220416 10:01:00.000000000	4469.10
8	20220423 03:30:00.000000000	4284.35
9	20220427 08:09:00.000000000	4347.81

[10 rows x 2 columns] total bytes: 160.0 B

Use merge_asof to get the most recent Index Price associated with each trade in your main dataset.

Note both datasets need to be sorted for merge_asof.

The on kwarg is the numeric/time field that looks for close matches.

The by kwarg is not necessary here, but could constrain the match to a subset if, for example, you had multiple indices and a column of which one each row is associated with.

Use direction='backward' to ensure you’re not biasing your data by looking into the future!

[48]:

my_dset.IndexPrice = my_dset.merge_asof(index_price, on=('TradeDateTime', 'DateTime'), direction='backward', columns_left=[])['IndexPrice']

Saving/Loading

The native riptable filetype is .sds. It’s the fastest way to save & load your data.

Save out your dataset to file using rt.save_sds.

[49]:

rt.save_sds('my_dset.sds', my_dset)

Delete your dataset to free up memory using the native python del my_dset.

Note that if there are references to the dataset in other objects you may not actually free up memory.

[50]:

del my_dset

Reload your saved dataset from disk with rt.load_sds.

[51]:

my_dset = rt.load_sds('my_dset.sds')

[52]:

my_dset.sample()

[52]:

#	Price	Symbol	Size	MyTrade	...	average_trade_price	Symbol_month_cat	Trader	IndexPrice
0	569.10	TSLA	719	True	...	499.91	(TSLA, 2022-02-01)	Elon	3945.79
1	915.31	GME	4	False	...	495.29	(GME, 2022-02-01)	Nate	3756.71
2	173.42	AMZN	166	False	...	497.66	(AMZN, 2022-03-01)	Dan	3972.01
3	410.27	GME	722	True	...	495.29	(GME, 2022-03-01)	Nate	4458.17
4	606.42	SPY	995	True	...	490.44	(SPY, 2022-03-01)	Josh	3954.01
5	910.27	AMZN	219	True	...	497.66	(AMZN, 2022-03-01)	Dan	4108.60
6	609.56	TSLA	459	False	...	499.91	(TSLA, 2022-04-01)	Elon	4384.97
7	466.54	TSLA	400	False	...	499.91	(TSLA, 2022-04-01)	Elon	4221.02
8	225.37	AMZN	150	False	...	497.66	(AMZN, 2022-04-01)	Dan	3688.96
9	912.44	AMZN	615	True	...	497.66	(AMZN, 2022-04-01)	Dan	3527.44

[10 rows x 17 columns] total bytes: 1.0 KB

To load from h5 files (a common file type at SIG), use rt.load_h5(file).

To load from csv files, use the slow but robust pandas loader, with rt.Dataset.from_pandas(pd.read_csv(file)).