Get and Operate on Subsets of Data Using Filters
Earlier, we used indexing and slicing to select data. You can also use filters to get data that meets a certain condition.
Datasets and FastArrays have a filter()
method that returns the subset
of data that meets a given condition. But to operate on that data, it’s often
better to pass a filter
keyword argument to the method you’re using.
This section covers:
How to create conditions for filtering – specifically, how comparison operators create mask arrays that can be used to filter
What to expect when you filter a FastArray or a Dataset
How string operations can be used to create filters
How to create more complex filters using logic operators
How to replace values using filters and
rt.where
How to operate on filtered Datasets in a memory-efficient way
Comparison Operators and Mask Arrays
When used to compare scalar values, comparison operators
(>
, <
, >=
, <=
, ==
, !=
) return True or False.
>>> x = 10
>>> x == 10 # equal
True
>>> x != 12 # not equal
True
>>> x > 12 # greater than
False
In NumPy and Riptable, comparison operators are ufuncs, which means they can be used to compare arrays.
When an array is compared element-wise with a scalar value or a same-length array, the result is an array of Booleans.
>>> a = rt.FastArray([1, 2, 3, 4, 5])
>>> b = rt.FastArray([0, 5, 2, 4, 8])
>>> a > 3
FastArray([False, False, False, True, True])
>>> a <= b
FastArray([False, True, False, True, True])
These Boolean arrays can be used to filter data. In this context, they’re often called Boolean mask arrays.
For the FastArray and Dataset filter()
methods, you can pass a Boolean mask
array directly or pass a comparison (or other operation) that results in a mask
array. Here, we’ll focus on the various ways to generate mask arrays based on
comparisons and other conditions.
Filter a FastArray with a Comparison
Above, we compared two FastArrays, a
and b
, using the condition
a <= b
to create a Boolean mask array.
To filter a
based on that
condition (that is, to show only the values of a
for which a <= b
is
True), use the FastArray filter()
method with the condition.
>>> f = a <= b
>>> a.filter(f)
FastArray([2, 4, 5])
Note that the returned FastArray is a copy; the original is unchanged:
>>> a
FastArray([1, 2, 3, 4, 5])
Filter a Dataset with a Comparison
Datasets also have a filter()
method. It returns a copy of the Dataset with
only the rows that meet the desired condition.
We’ll work with this Dataset:
>>> ds = rt.Dataset({
... 'OSISymbol': ['VIX:200520:35:0:P', 'AAPL:200417:255:0:P', 'LITE:200619:82:5:P',
... 'SPY:200406:265:0:C', 'MFA:200515:2:0:C', 'XOM:220121:60:0:C',
... 'CCL:200717:12:5:C', 'AXSM:200515:85:0:C', 'UBER:200515:33:0:C',
... 'TLT:200529:165:0:P'],
... 'UnderlyingSymbol': ['VIX', 'AAPL', 'LITE', 'SPY', 'MFA', 'XOM', 'CCL', 'AXSM',
... 'UBER', 'TLT'],
... 'TradeDate': rt.Date(['2020-03-03', '2020-03-19', '2020-03-24', '2020-04-06',
... '2020-04-20', '2020-04-23', '2020-04-27', '2020-05-01',
... '2020-05-13', '2020-05-26']),
... 'TradeSize': [3., 1., 5., 50., 10., 5., 1., 6., 3., 1.],
... 'TradePrice': [13.4, 27.5, 14.8, 0.14, 0.29, 3.75, 2.55, 7.79, 0.77, 1.78 ],
... 'OptionType': ['P', 'P', 'P', 'C', 'C', 'C', 'C', 'C', 'C', 'P'],
... 'Traded': [False, False, True, False, True, True, False, True, True, False]
... })
>>> ds
# OSISymbol UnderlyingSymbol TradeDate TradeSize TradePrice OptionType Traded
- --------------- ---------------- ---------- --------- ---------- ---------- ------
0 VIX:200520:35:0 VIX 2020-03-03 3.00 13.40 P False
1 AAPL:200417:255 AAPL 2020-03-19 1.00 27.50 P False
2 LITE:200619:82: LITE 2020-03-24 5.00 14.80 P True
3 SPY:200406:265: SPY 2020-04-06 50.00 0.14 C False
4 MFA:200515:2:0: MFA 2020-04-20 10.00 0.29 C True
5 XOM:220121:60:0 XOM 2020-04-23 5.00 3.75 C True
6 CCL:200717:12:5 CCL 2020-04-27 1.00 2.55 C False
7 AXSM:200515:85: AXSM 2020-05-01 6.00 7.79 C True
8 UBER:200515:33: UBER 2020-05-13 3.00 0.77 C True
9 TLT:200529:165: TLT 2020-05-26 1.00 1.78 P False
Say we want to see only the rows with options that are puts.
The syntax is the same as for FastArrays:
>>> f = ds.OptionType == 'P'
>>> ds.filter(f)
# OSISymbol UnderlyingSymbol TradeDate TradeSize TradePrice OptionType Traded
- --------------- ---------------- ---------- --------- ---------- ---------- ------
0 VIX:200520:35:0 VIX 2020-03-03 3.00 13.40 P False
1 AAPL:200417:255 AAPL 2020-03-19 1.00 27.50 P False
2 LITE:200619:82: LITE 2020-03-24 5.00 14.80 P True
3 TLT:200529:165: TLT 2020-05-26 1.00 1.78 P False
By default all columns are returned. If you want to return only certain columns, you can combine the mask array with column selection:
>>> ds.filter(f).col_filter(['OSISymbol', 'TradeSize'])
Alternatively, you can use the syntax we used to select Dataset rows to select rows based on the filter, along with the columns you want:
>>> ds[f, [0, 3]]
# OSISymbol TradeSize
- --------------- ---------
0 VIX:200520:35:0 3.00
1 AAPL:200417:255 1.00
2 LITE:200619:82: 5.00
3 TLT:200529:165: 1.00
Here it could also make sense to pass the Traded column directly as a mask array:
>>> ds.filter(ds.Traded)
# OSISymbol UnderlyingSymbol TradeDate TradeSize TradePrice OptionType Traded
- --------------- ---------------- ---------- --------- ---------- ---------- ------
0 LITE:200619:82: LITE 2020-03-24 5.00 14.80 P True
1 MFA:200515:2:0: MFA 2020-04-20 10.00 0.29 C True
2 XOM:220121:60:0 XOM 2020-04-23 5.00 3.75 C True
3 AXSM:200515:85: AXSM 2020-05-01 6.00 7.79 C True
4 UBER:200515:33: UBER 2020-05-13 3.00 0.77 C True
Note: Keep in mind that every time you use filter()
, it makes a copy of
the Dataset that takes up memory. We cover a couple of strategies for minimizing
memory use below, when we talk about operations on filtered data.
Use FastArray String Methods to Create Filters
FastArray string methods are useful for creating conditions you can use to filter.
Create a filter for OSISymbol strings that start with ‘A’:
>>> f = ds.OSISymbol.str.startswith('A')
>>> f
FastArray([False, True, False, False, False, False, False, True, False, False])
For OSISymbol strings that contain the substring ‘2005’:
>>> f = ds.OSISymbol.str.contains('2005')
>>> f
FastArray([ True, False, False, False, True, False, False, True, True, True])
For UnderlyingSymbol strings that end with ‘L’:
>>> f = ds.UnderlyingSymbol.str.regex_match('L$')
>>> f
FastArray([False, True, False, False, False, False, True, False, False, False])
Create More Complex Boolean Mask Filters with Bitwise Logic Operators (&
, |
, ~
)
You can build more complex filters using Python’s bitwise logic
operators, &
(bitwise and), |
(bitwise or), and ~
(bitwise
not).
Let’s say you want to construct a filter that returns True for calls
over $2.00. You can use &
to ensure that both of those conditions
are met:
>>> callsover2 = (ds.OptionType == 'C') & (ds.TradePrice > 2.00)
>>> callsover2
FastArray([False, False, False, False, False, True, True, True, False, False])
Warning: When you use bitwise logic operators, always wrap the
expressions on either side in parentheses (as above) to make sure
they’re evaluated in the right order. Without the parentheses, operator
precedence rules would cause the expression above to be evaluated as
ds.OptionType == ('C' & ds.TradePrice) > 2.00
, which would result in
an extremely slow call into native Python, followed by a crash. Also
note that the Python keywords AND
, OR
, and NOT
do not work
with Boolean arrays. Use &
, |
, or ~
instead.
More examples of filter combinations:
>>> # Define two filters
>>> f1 = (ds.TradeSize <= 3.00)
>>> f2 = (ds.TradePrice > 3.00)
True if both are True:
>>> f1 & f2
FastArray([ True, True, False, False, False, False, False, False, False, False])
True if either one is True:
>>> f1 | f2
FastArray([ True, True, True, False, False, True, True, True, True, True])
The negation of the f1
filter:
>>> ~f1
FastArray([False, False, True, True, True, True, False, True, False, False])
If you have complex filter criteria you want to reuse, assigning variable names to your filters can make things easier. You can also store your filters in a Riptable Struct:
>>> s = rt.Struct()
>>> s.ds = ds
>>> s.callsover2 = callsover2
>>> s
# Name Type Size 0 1 2
- ---------- ------- ---------------- ----- ----- -----
0 ds Dataset 10 rows x 7 cols
1 callsover2 bool 10 False False False
Set Values in Columns with Masks and rt.where()
You can also use mask arrays to update values that meet the filter condition.
Note, though, that the values are updated in place, not copied!
Suppose you want to update all the puts to be marked as traded. The FastArray
filter()
method doesn’t let you set new values, but you can use the following
syntax:
>>> f = ds.OptionType == 'P'
>>> ds.Traded[f] = True
>>> ds
# OSISymbol UnderlyingSymbol TradeDate TradeSize TradePrice OptionType Traded
- --------------- ---------------- ---------- --------- ---------- ---------- ------
0 VIX:200520:35:0 VIX 2020-03-03 3.00 13.40 P True
1 AAPL:200417:255 AAPL 2020-03-19 1.00 27.50 P True
2 LITE:200619:82: LITE 2020-03-24 5.00 14.80 P True
3 SPY:200406:265: SPY 2020-04-06 50.00 0.14 C False
4 MFA:200515:2:0: MFA 2020-04-20 10.00 0.29 C True
5 XOM:220121:60:0 XOM 2020-04-23 5.00 3.75 C True
6 CCL:200717:12:5 CCL 2020-04-27 1.00 2.55 C False
7 AXSM:200515:85: AXSM 2020-05-01 6.00 7.79 C True
8 UBER:200515:33: UBER 2020-05-13 3.00 0.77 C True
9 TLT:200529:165: TLT 2020-05-26 1.00 1.78 P True
What if you want to provide one value where the mask is True and a different value where the mask is False?
rt.where()
is a function that works as an if-then-else procedure.
It takes three arguments:
condition
x
y
Where the condition is met, it returns x
; otherwise, it returns y
. (If
x
or y
is an array, the value that corresponds to the True or False is
used.)
Here, for instance, rt.where
returns a
where a < 5
; otherwise
it returns 10 * a
:
>>> a = rt.FA([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> rt.where(a < 5, a, 10 * a)
FastArray([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
In the Dataset above, we can have rt.where()
mark puts as traded and calls as
not traded. Note that rt.where()
returns a FastArray, so the result needs to
be assigned as a Dataset column.
>>> ds.Traded = rt.where(ds.OptionType == 'P', True, False)
>>> ds[['OptionType', 'Traded']]
# OptionType Traded
- ---------- ------
0 P True
1 P True
2 P True
3 C False
4 C False
5 C False
6 C False
7 C False
8 C False
9 P True
Operate on Filtered Data in a Dataset
Looking at filtered data can provide some useful insights. But often, you want to operate on it.
Say you want to compute the total size of options that were traded.
Given that we just covered filter()
, you might be tempted to do
this:
>>> ds.filter(ds.Traded).TradeSize.nansum()
303.0
However, remember that filter()
returns a copy of the Dataset,
filtered by the mask array. This is unnecessary here – we’re only
interested in the subset of one column of data. Fortunately, there are a
couple of ways to work only on the data we need.
We can pass a filter argument to nansum()
with the Boolean array
contained in ds.Traded
:
>>> ds.TradeSize.nansum(filter=ds.Traded)
303.0
This gets the sum of only the values in the TradeSize column that meet the filter criteria.
Note that the filter=
is needed here – if you just pass the Boolean
array by itself, the array will be silently ignored:
>>> ds.TradeSize.nansum(ds.Traded)
384.0
Alternatively, we can use the FastArray filter()
method to get only the
sizes for the options that were traded. Then we get the sum:
>>> ds.TradeSize.filter(ds.Traded).nansum()
303.0
Both of these methods are much more memory-friendly and computationally efficient than filtering (and making a copy of) the entire Dataset.
Next, we’ll check out Riptable’s datetime objects: Work with Dates and Times.
Questions or comments about this guide? Email RiptableDocumentation@sig.com.