riptable.rt_merge_asof
Time/ordering-based merge implementations.
Functions
|
Perform an as-of merge. This is similar to a left-join except that we match on nearest key rather than equal keys. |
- riptable.rt_merge_asof.merge_asof2(left, right, on=None, left_on=None, right_on=None, by=None, left_by=None, right_by=None, suffixes=None, copy=True, columns_left=None, columns_right=None, *, tolerance=None, allow_exact_matches=True, direction='backward', matched_on=False, **kwargs)
Perform an as-of merge. This is similar to a left-join except that we match on nearest key rather than equal keys.
Both Datasets must be sorted (ascending) by the ‘on’ column. When ‘by’ columns are specified, the ‘on’ column for each Dataset only needs to be sorted (ascending) within each unique ‘key’ of the ‘by’ columns. Sorting the entire
Dataset
by the ‘on’ column also meets this requirement, but someDatasets
may have an ‘on’ column which is already pre-sorted within each ‘by’ key, in which case no additional sorting is required.- For each row in the left Dataset:
A “backward” search selects the last row in the right Dataset whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right Dataset whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the row in the right Dataset whose ‘on’ key is closest in absolute distance to the left’s key.
Optionally match on equivalent keys with ‘by’ before searching with ‘on’.
- Parameters:
left (Dataset) – Left Dataset
right (Dataset) – Right Dataset
on (str) – Column name to join on. Must be found in both the
left
andright
Datasets. This column in both left and right Datasets MUST be ordered. Furthermore this must be a numeric column, such as datetimelike, integer, or float. Eitheron
orleft_on
/right_on
must be specified.left_on (str or list of str, optional) – Column name to join on in
left
Dataset.right_on (str or list of str, optional) – Column name to join on in
right
Dataset.by (str or (str, str) or list of str or list of (str, str), optional) – Column name or list of column names. Match on these columns before performing merge operation.
left_by (str or list of str, optional) – Column names to match on in the left Dataset.
right_by (str or list of str, optional) – Column names to match on in the right Dataset.
suffixes ((str, str), optional, default None) – Suffix to apply to overlapping column names in the left and right side, respectively.
copy (bool, default True) – If False, avoid copying data when possible; this can reduce memory usage but users must be aware that data can be shared between
left
and/orright
and the Dataset returned by this function.columns_left (str or list of str, optional) – Column names to include in the merge from
left
, defaults to None which causes all columns to be included.columns_right (str or list of str, optional) – Column names to include in the merge from
right
, defaults to None which causes all columns to be included.tolerance (integer or float or Timedelta, optional, default None) – Tolerance allowed when performing the ‘asof’ part of the merge; whenever a row from
left
doesn’t have a key inright
within this distance or less, that row will have a null/missing/NA value for any columns from theright
Dataset which appear in the merged result.allow_exact_matches (boolean, default True) –
If True, allow matching with the same ‘on’ value (i.e. less-than-or-equal-to / greater-than-or-equal-to)
If False, don’t match the same ‘on’ value (i.e., strictly less-than / strictly greater-than)
direction ({'backward', 'forward', or 'nearest'}, default 'backward') – Whether to search for prior, subsequent, or closest matches.
matched_on (bool or str, default False) – If set to True or a string, an additional column is added to the result; for each row, it contains the value from the
on
column inright
that was matched. When True, the column will use the default name ‘matched_on’; specify a string to explicitly name the column.
- Returns:
merged
- Return type:
- Raises:
ValueError – The
on
,left_on
, orright_on
columns are not sorted in ascending order within one or more keys of theby
columns.
See also
riptable.merge_asof
Notes
- TODO: Consider allowing the
tolerance
parameter to also be a 1D array/sequence whose length is the same as the number of groups (or keys?) in the
left
dataset. That allows a per-group tolerance to be specified if needed.
Examples
>>> left = rt.Dataset({'a': [1, 5, 10], 'left_val': ['a', 'b', 'c']}) >>> left # a left_val - -- -------- 0 1 a 1 5 b 2 10 c >>> right = rt.Dataset({'a': [1, 2, 3, 6, 7], ... 'right_val': [1, 2, 3, 6, 7]}) >>> right # a right_val - - --------- 0 1 1 1 2 2 2 3 3 3 6 6 4 7 7 >>> rt.merge_asof(left, right, on='a') # a_x left_val a_y right_val - --- -------- --- --------- 0 1 a 1 1 1 5 b 3 3 2 10 c 7 7
>>> rt.merge_asof(left, right, on='a', allow_exact_matches=False) # a_x left_val a_y right_val - --- -------- --- --------- 0 1 a Inv Inv 1 5 b 3 3 2 10 c 7 7
>>> rt.merge_asof(left, right, on='a', direction='forward') # a_x left_val a_y right_val - --- -------- --- --------- 0 1 a 1 1 1 5 b 6 6 2 10 c Inv Inv
Here is a real-world time-series example
>>> quotes # time ticker Bid Ask - -------------------- ------ ---- ---- 0 20191015 09:45:57.09 AAPL 3.40 3.50 1 20191015 11:35:09.76 AAPL 3.45 3.55 2 20191015 12:02:27.11 AAPL 3.50 3.60 3 20191015 12:43:13.73 MSFT 2.85 2.95 4 20191015 14:32:11.18 MSFT 2.90 3.00
>>> trades # time ticker TradePrice TradeSize - -------------------- ------ ---------- --------- 0 20191015 10:03:24.73 AAPL 3.45 1.00 1 20191015 10:41:22.79 MSFT 2.85 1.00 2 20191015 10:41:35.69 MSFT 2.86 1.00 3 20191015 11:04:32.55 AAPL 3.47 1.00 4 20191015 11:44:35.63 MSFT 2.91 1.00 5 20191015 12:26:17.68 AAPL 3.55 1.00 6 20191015 14:24:10.93 MSFT 2.98 1.00 7 20191015 15:45:13.41 AAPL 3.60 7.00 8 20191015 15:50:42.53 AAPL 3.58 1.00 9 20191015 15:53:59.60 AAPL 3.56 5.00
>>> rt.merge_asof(trades, quotes, on='time', by='ticker') # time_x ticker_x TradePrice TradeSize time_y ticker_y Bid Ask - -------------------- --------- ---------- --------- -------------------- -------- ---- ---- 0 20191015 10:03:24.73 AAPL 3.45 1.00 20191015 09:45:57.09 AAPL 3.40 3.50 1 20191015 10:41:22.79 MSFT 2.85 1.00 Inv Filtered nan nan 2 20191015 10:41:35.69 MSFT 2.86 1.00 Inv Filtered nan nan 3 20191015 11:04:32.55 AAPL 3.47 1.00 20191015 09:45:57.09 AAPL 3.40 3.50 4 20191015 11:44:35.63 MSFT 2.91 1.00 Inv Filtered nan nan 5 20191015 12:26:17.68 AAPL 3.55 1.00 20191015 12:02:27.11 AAPL 3.50 3.60 6 20191015 14:24:10.93 MSFT 2.98 1.00 20191015 12:43:13.73 MSFT 2.85 2.95 7 20191015 15:45:13.41 AAPL 3.60 7.00 20191015 12:02:27.11 AAPL 3.50 3.60 8 20191015 15:50:42.53 AAPL 3.58 1.00 20191015 12:02:27.11 AAPL 3.50 3.60 9 20191015 15:53:59.60 AAPL 3.56 5.00 20191015 12:02:27.11 AAPL 3.50 3.60
only merge with the forward quotes
>>> rt.merge_asof(trades, quotes, on='time', by='ticker', direction='forward') # time_x ticker_x TradePrice TradeSize time_y ticker_y Bid Ask - -------------------- -------- ---------- --------- -------------------- -------- ---- ---- 0 20191015 10:03:24.73 AAPL 3.45 1.00 20191015 11:35:09.76 AAPL 3.45 3.55 1 20191015 10:41:22.79 MSFT 2.85 1.00 20191015 12:43:13.73 MSFT 2.85 2.95 2 20191015 10:41:35.69 MSFT 2.86 1.00 20191015 12:43:13.73 MSFT 2.85 2.95 3 20191015 11:04:32.55 AAPL 3.47 1.00 20191015 11:35:09.76 AAPL 3.45 3.55 4 20191015 11:44:35.63 MSFT 2.91 1.00 20191015 12:43:13.73 MSFT 2.85 2.95 6 20191015 14:24:10.93 MSFT 2.98 1.00 20191015 14:32:11.18 MSFT 2.90 3.00 7 20191015 15:45:13.41 AAPL 3.60 7.00 Inv Filtered nan nan 8 20191015 15:50:42.53 AAPL 3.58 1.00 Inv Filtered nan nan 9 20191015 15:53:59.60 AAPL 3.56 5.00 Inv Filtered nan nan 5 20191015 12:26:17.68 AAPL 3.55 1.00 Inv Filtered nan nan