Riptable Exercises

This workbook is meant to give practical experience with the key ideas & functionality of Riptable.

To complete it, you’ll need to consult the Intro to Riptable.

Depending on your preferred learning style, you can read through the entire intro guide first or start with the exercises and refer to the guide as needed.

Note that the intro guide has more coverage and detail, so it’s well worth reading in full at some point.

If you have any questions or comments, email RiptableDocumentation@sig.com.

[1]:
import riptable as rt
import numpy as np

Introduction to the Riptable Dataset

Datasets are the core class of riptable.

They are tables of data, consisting of a series of columns of the same length (sometimes referred to as fields).

Structurally, they behave like python dictionaries, and can be created directly from one.

We’ll familiarize ourselves with Datasets by manually constructing one by generating fake sample data using np.random.default_rng().choice(...) or similar.

In real life they will essentially always be generated from world data.

First, create a python dictionary with two fields of the same length (>1000); one column of stock prices and one of symbols.

Make sure the symbols have duplicates, for later aggregation exercises.

[ ]:

[ ]:

Create a riptable dataset from this, using rt.Dataset(my_dict).

[ ]:

You can easily append more columns to a dataset.

Add a new column of integer trade size, using my_dset.Size =.

[ ]:

Columns can be referred with brackets around a string name as well. This is typically used when the column name comes from a variable.

Add a new column of booleans indicating whether you traded this trade, using my_dset['MyTrade'] =.

[ ]:

Add a new column of string “Buy” or “Sell” indicating the customer direction.

[ ]:

Riptable will convert these lists to the riptable FastArray container and cast the data to an appropriate numpy datatype.

View the datatypes with my_dset.dtypes.

[ ]:

View some sample rows of the dataset using .sample().

You should use this instead of .head() because the initial rows of a dataset are often unrepresentative.

[ ]:

View distributional stats of the numerical fields of your dataset with .describe().

You can call this on a single column as well.

[ ]:

Manipulating data

You can perform simple operation on riptable columns with normal python syntax. Riptable will do them to the whole column at once, efficiently.

Create a new column by performing scalar arithmetic on one of your numeric columns.

[ ]:

[ ]:

As long as the columns are the same size (as is guaranteed if they’re in the same dataset) you can perform combining operations the same way.

Create a new column of total price paid for the trade by multiplying two existing columns together.

Riptable will automatically upcast types as necessary to preserve information.

[ ]:

[ ]:

There are many built-in functions as well, which you call with either my_dset.field.function() or rt.function(my_dset.field) syntax.

Find the unique Symbols in your dataset.

[ ]:

Date/Time

Riptable has three main date/time types: Date, DateTimeNano, and TimeSpan.

Give each row of your dataset an rt.Date.

Make sure they’re not all different, but still include days from multiple months.

Note that due to Riptable idiosyncracies you need to generate a list of yyyymmdd strings and pass into the rt.Date(...) constructor, not construct Dates individually.

[ ]:

[ ]:

Give each row a unique(ish) TimeSpan as a trade time.

You can instantiate them using rt.TimeSpan(hours_var, unit='h').

[ ]:

[ ]:

Create a DateTimeNano of the combined TradeTime + Date by simple addition. Riptable knows how to sum the types.

Be careful here, by default you’ll get a GMT timezone, you can force NYC with rt.DateTimeNano(..., from_tz='NYC').

[ ]:

[ ]:

To reverse this operation and get out separate dates and times from a DateTimeNano, you can call rt.Date(my_DateTimeNano) and my_DateTimeNano.time_since_midnight().

Create a new month name column by using the .strftime function.

[ ]:

[ ]:

Create another new month column by using the .start_of_month attribute.

This is nice for grouping because it will automatically sort correctly.

[ ]:

[ ]:

Sorting

Riptable has two sorts, sort_copy (which preserves the original dataset) and sort_inplace, which is faster and more memory-efficient if you don’t need the original data order.

Sort your dataset by TradeDateTime.

This is the natural ordering of a list of trades, so do it in-place.

[ ]:

[ ]:

Filtering

Filtering is the principal way to work with a subset of your data in riptable. It is commonly used for looking at a restricted set of trades matching some criterion you care about.

Except in rare instances, though, you should maintain your dataset in its full size, and only apply a filter when performing a final computation.

This will avoid unnecessary data duplication and improve speed & memory usage.

Construct a filter of only your sales. (A filter is a column of Booleans which is true only for the rows you’re interested in.)

You can combine filters using & or |. Be careful to always wrap expressions in parentheses to avoid an extremely slow call into native python followed by a crash.

Always (my_dset.field1 > 10) & (my_dset.field2 < 5), never my_dset.field1 > 10 & my_dset.field2 > 5.

[ ]:

Compute the total Trade Size, filtered for only your sales.

For this and many other instances, you can & should pass your filter into the filter kwarg of the .nansum(...) call.

This allows riptable to perform the filtering during the nansum computation, rather than instantiating a new column and then summing it.

[ ]:

Count how many times you sold each symbol.

Here the .count() function doesn’t accept a filter kwarg, so you must fall back to explicitly filtering the Symbol field before counting.

Be careful that you only filter down the Symbol field, not the entire dataset, otherwise you are wasting a lot of compute.

[ ]:

Categoricals

So far, we’ve been operating on your symbol column as a column of strings.

However, it’s far more efficient when you have a large column with many repeats to use a categorical, which assigns each unique value a number, and stores the labels & numbers separately.

This is memory-efficient, and also computationally efficient, as riptable can peform operations on the unique values, then expand out to the full vector appropriately.

Make a new column of your string column converted to a categorical, using rt.Cat(column).

[ ]:

Perform the same filtered count from above, on the categorical.

The categorical .count() admits a filter kwarg, which makes it simpler.

[ ]:

Categoricals can be used as groupings. When you call a numeric function on a categorical and pass numeric columns in, riptable knows to do the calculation per-group.

Compute the total amount of contracts sold by customers in each symbol.

[ ]:

The transform=True kwarg in a categorical operation performs the aggregation, then transforms it back up to the original shape of the categorical, giving each row the appropriate value from its group.

Make a new column which is the average trade price, per symbol.

[ ]:

Inspect with .sample() to confirm that this value is consistent for rows with matching symbol.

[ ]:

If you need to perform a custom operation on each categorical, you can pass in a function with .apply_reduce (which aggregates) or .apply_nonreduce (which is like transform=True).

Note that the custom function you pass needs to expect a FastArray, and output a scalar (apply_reduce) or same-length FastArray (apply_nonreduce).

Find, for each symbol, the trade size of the second trade occuring in the dataset.

[ ]:

Sometimes you want to aggregate based on multiple values. In these cases we use multi-key categoricals.

Use a multi-key categorical to compute the average size per symbol-month pair.

[ ]:

[ ]:

Accumulating

Aggregating over two values for human viewing is often most conveniently done with an accum.

Use Accum2 to compute the average size per symbol-month pair.

[ ]:

Average numbers can be meaningless. It is often better to consider relative percentage instead.

Use accum_ratiop to compute the fraction of total volume done by each symbol-month pair.

[ ]:

Merging

There are two main types of merges.

First is merge_lookup. This is used for enriching one (typically large) dataset with information from another (typically small) dataset.

Create a new dataset with one row per symbol from your dataset, and a second column of who trades each symbol.

[ ]:

[ ]:

Enrich the main dataset by putting the correct trader into each row.

[ ]:

[ ]:

The second type of merge is merge_asof, which is used for fuzzy alignment between two datasets, typically by time (though often by other variables).

Create a new index price dataset with one price per minute, which covers all the Dates in your dataset.

The index price doesn’t need to be reasonable.

Each row should have a DateTimeNano as the datetime.

[ ]:

[ ]:

[ ]:

Use merge_asof to get the most recent Index Price associated with each trade in your main dataset.

Note both datasets need to be sorted for merge_asof.

The on kwarg is the numeric/time field that looks for close matches.

The by kwarg is not necessary here, but could constrain the match to a subset if, for example, you had multiple indices and a column of which one each row is associated with.

Use direction='backward' to ensure you’re not biasing your data by looking into the future!

[ ]:

Saving/Loading

The native riptable filetype is .sds. It’s the fastest way to save & load your data.

Save out your dataset to file using rt.save_sds.

[ ]:

Delete your dataset to free up memory using the native python del my_dset.

Note that if there are references to the dataset in other objects you may not actually free up memory.

[ ]:

Reload your saved dataset from disk with rt.load_sds.

[ ]:

[ ]:

To load from h5 files (a common file type at SIG), use rt.load_h5(file).

To load from csv files, use the slow but robust pandas loader, with rt.Dataset.from_pandas(pd.read_csv(file)).